By — Aliz Hammond — Oct 9, 2022

All Around The World: The Common Crawl Dataset

At watchTowr, we're big believers that data is power, and ultimately data drives security initiatives - like Attack Surface Management, which we then use to power continuous security testing within the watchTowr Platform.

Who remembers the pre-Google days of searching the web with Altavista, Lycos, or other such search engines? Information was hard to find, despite being out there. There was a common adage - "The Internet is like the world's biggest library, except all the books are lying on the floor, unsorted". And then along came Google, and PageRank, and the web suddenly got much more accessible.

But Google isn't designed with security research in mind. Other engines, such as the fantastic Shodan engine, have stepped up to fill this niche, and are very useful to those seeking a holistic view of their web-based attack surface. But these services have their limits. Wouldn't it be good if you had direct access to the exactly kind of dataset that Google themselves generate search results from? Then you could simply 'query the web' for attack surface exposed by your organisation which you may be unaware of.

Of course, I can't give you access to Google's database, but I can turn you on to the next best thing - the Common Crawl Project. The project, in their own words, "[Builds] and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". Essentially, it is a crawl of the web, archived into huge petabyte-scale data files. It even contains historic data, going back to 2008.

The Common Crawl dataset is exciting - it's enables rapid discovery of additional attack surface with huge amounts of data for us to query, and secondly, to identify the exposure of seriously sensitive information.

As we walk through the power of Common Crawl - we'll provide a few examples for each category to demonstrate that we are not totally crazy - it really is very powerful.

Setup And General Use

To be useful for attack surface detection, we need the ability to search the dataset in complex ways, as if we were querying any other SQL dataset. "Just use grep" said an unnamed, inspired and ambitious individual.

But at scale, with this amount of data, running our arbitrary searches and queries at a whim is best done with real tools, and not random string matching - fore example, Amazon's 'Athena' tool.

It requires a bit of setup (groan, see here) but it's quite quick and (relatively) painless, even for those not well-versed in Big Data such as myself. In summary:

Open https://aws.amazon.com/athena/
Configure an s3 bucket in the 'settings' tab
Execute CREATE DATABASE ccindex
Execute the 'Create table' query found at https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/athena/cc-index-create-table-flat.sql
And finally, execute MSCK REPAIR TABLE ccindex.

This results in a dataset you can query easily. There are a bunch of queries on the page linked above which illustrate the power of the dataset, but the following examples serve to show how useful the dataset is within Attack Surface Management.

Attack Surface Discovery

If we throw ourselves into imagination land - let's imagine a (incredibly plausible) scenario in which a lone developer is running a site detrimental to the organisation's security posture (or image), unbeknownst to the wider IT staff.

A mature attack-surface management team should be able to locate this site in order that it can be secured and managed, along with all the other IT assets. This sounds trivial when your attack surface is 10 machines - but this is not a simple task when done at scale across hundreds of thousands of assets, and requires that we use all the data and tools at our disposal to obtain as complete a picture as possible.

Our first three examples will focus on this problem - discovery of 'shadow IT' assets, undocumented and unmanaged, which can otherwise damage a brand or a business.

Attack Surface Discovery: Locating Subdomains

Since the Common Crawl corpus includes domain names in the dataset, it is very easy to search for any domains it has spidered that reference your organisation by name. Doing so is a quick way to discover additional attack surface, fueling our thirst for complete attack surface visibility.

A query to find this information is simple enough to be self-explanatory:

SELECT distinct(url_host_name)
FROM "ccindex"."ccindex"
WHERE crawl like 'CC-MAIN-2022-33'
and "url_host_name" like '%microsoft%'

Query for hostnames containing the word 'microsoft'.

Our example uses the company Microsoft, who apparently have a large amount of resources, showing the capability of the dataset to locate a large amount of information. Note that the query can be more specific, to find (for example) subdomains, or to search by a complex pattern:

SELECT distinct(url_host_name)
FROM "ccindex"."ccindex"
WHERE crawl like 'CC-MAIN-2022-33'
and regexp_like("url_host_name", '.*\.microsoft\.(sg|com|org|co\.uk)')

Query for hostnames containing the word 'microsoft', and ending in specific TLDs.

In this simple example - we find attack surface under the main organisation name, suffixed with various TLDs. This can be cross-referenced with an asset list in order to discover new attack surfaces.

Attack Surface Discovery: Locating Resources With Specific Content

Of course, life is not so easy, not all such assets can be discovered via a simple domain name query.

One other way that we find to be quite effective is to search for sites that host your organisation's logo - while noisy, this can often be incredibly effective in finding assets that have been fully outsourced to third parties (*cough* that random web development agency you used once *cough*).

While existing search engines can perform a barely-controllable, fallible 'reverse image search', we prefer to use the Common Crawl dataset, since it is more friendly to programmatic access, and will generalise easily beyond image data to any other HTTP resource. And because, it's just more fun.

As a quick demonstration, let's look for anywhere that hosts Google's logo.

Since the Common Crawl dataset indexes resources by base32-encoded sha1 hash, our first step is to find this hash:

$ curl https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png | openssl dgst -binary -sha1 | base32
62PCMRHDCFS33FJRDQXPNVLDZ7VRXTAT

A command to obtain the base32-encoded SHA1 hash of 'googlelogo_color_92x30dp.png'.

Given this hash, 62PCMRHDCFS33FJRDQXPNVLDZ7VRXTAT, we can easily query for instances.

select * 
FROM "ccindex"."ccindex"
WHERE crawl like 'CC-MAIN-2022-33'
and content_digest = '62PCMRHDCFS33FJRDQXPNVLDZ7VRXTAT'

A query to find all URLs hosting the Google logo.

The results show all URLs which mirror this resource.

This approach will generalise to any kind of resource - for example, you could search for a given source file in order to locate leaks of proprietary information, using a technique much the same.

Attack Surface Discovery: Building A Wordlist

A very public technique that all have likely heard of, regularly used by many to discover additional subdomains exposed by an organisation, is that of subdomain brute-forcing.

In this process, we use a dictionary to generate DNS queries, and note any that successfully resolve, as they usually represent additional client attack surface. While this is a powerful technique, it is somewhat hindered by the frequent use of domain-specific terminology - for example, a company that manufactures plastics may have domains that use specialised vocabulary related to polymers which are not found in (any) wordlists.

Obviously, the success of this approach is largely dictated by the quality and relevance of the wordlist involved. One way to improve the quality of a wordlist is by spidering a client site to gather vocabulary, and the Common Crawl corpus can assist in this, supplying a snapshot of the client's website for us to easily ingest.

The code to do so is straightforward - my solution (see the watchTowr GitHub repo) came to around 110 lines of Python, including whitespace.

First up, we query Athena, and wait for the query to complete. There's a lot of boilerplate here but the essence is simple - run a query to fetch all pages in a given domain, and place the results in an s3 bucket.

	queryStart = self.client.start_query_execution(
		QueryString=
		f" select warc_filename, warc_record_offset, warc_record_length "
		f" FROM ccindex.ccindex "
		f" WHERE crawl = '{self.crawl}'"
		f" and url_host_registered_domain = '{self.domainName}'",
			QueryExecutionContext={
				'Database': 'ccindex'
			},
			ResultConfiguration={
				'OutputLocation': f's3://{self.s3bucket}/wordlist'
			}
		)
        QueryExecutionId = queryStart['QueryExecutionId']

	delay = 1
	while True:
		queryExecution = self.client.get_query_execution(QueryExecutionId = QueryExecutionId)
		state = queryExecution['QueryExecution']['Status']['State']
		if state in ("QUEUED", "RUNNING"):
			if delay < 60:
				delay = delay * 3
			time.sleep(delay)
			continue
		break
	if state != "SUCCEEDED":
		raise Exception(f"Query did not succeed, it is in state '{state}'")

Once the query is complete, we open the output file in s3, and find the values we need to fetch the file itself (I've omitted some details here for clarity).

	with BytesIO() as f:
		self.s3.download_fileobj(self.s3bucket, f'wordlist/{QueryExecutionId}.csv', f)
		f.seek(0)
		for line in f.readlines():
			warc_filename, warc_record_offset, warc_record_length = line.decode("ascii").split(",")

Once we have these values, we use the requests module to fetch the compressed resource into memory. Note the use of the Range header to select the correct data.

	with requests.request("GET",
    	f"https://data.commoncrawl.org/{warc_filename}",
        headers={'Range': f"bytes={warc_record_offset}-{warc_record_offset+warc_record_length-1}"}, 
        stream=True) as req:
		req.raise_for_status()
		# Read compressed data from the HTTP stream into memory
		s = req.raw.stream(1024, decode_content=False)
		gz = BytesIO()
		for chunk in s:
			gz.write(chunk)

And then we can uncompress the data.

	# And decompress the gzip'ped data.
	gz.seek(0)
	with gzip.GzipFile(fileobj=gz) as ungz:
		while True:
			chunk = ungz.read()
			if chunk is None or chunk == b'':
				break

Finally, we can do as we see fit with the document, splitting it into words and keeping a count.

class wordlist:
	def __init__(self):
		self.words = {}

	def addDocument(self, document):
		# Remove HTML tags
		documentParsed = BeautifulSoup(document)
		documentContents = documentParsed.get_text()

		# We'll split on any of these characters
		for delim in " \r@{}[]()<>,.='\"&;:/\\%":
			document = document.replace(delim, "\n")

		for word in document.split():
			self.words[word] = self.words.get(word, 0) + 1

words.addDocument(chunk.decode("ascii", errors='ignore') )

Adding some functionality to print the output shows the most popular words:

.. snip .. 
Seen 1 time(s): 'holistic'
Seen 1 time(s): 'approach'
Seen 1 time(s): 'provides'
Seen 1 time(s): 'paired'
Seen 1 time(s): 'empowering'
Seen 1 time(s): 'capabilities'
.. snip ..

As you can see, words relevant to the organisation have been located.

The full code is available at the watchTowr GitHub repo.

Assessing Attack Surface

Once we find attack surface, the Common Crawl dataset is still useful to us - turning data, into identified exposure of potentially sensitive information. Here are a few examples to show how the dataset can assist in this area.

Assessment: Finding Content By Type

For many organisations, a database leak is a 'worst nightmare' scenario, in terms of reputation, compliance, and regulation. These leaks are often accidental, caused by an inadvertently-accessible backup, for example. The dataset can help here since it can be searched by mime-type (as detected by the crawler, or as declared by the web server), and help us locate such leaks and remediate them before an adversary has a chance to take advantage of them.

Given a portion of attack surface (for example, a domain name), we can search the corpus for files which are detected to contain SQL code, as a database dump created with the mysqldump tool would:

SELECT *
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2022-27'
and fetch_status = 200
and "url" like '%.sql'
and "url_host_name = '<your domains>'
and subset != 'robotstxt'
and "content_mime_detected" = 'text/x-sql'
order by "warc_record_length" desc

This is a powerful query, which reveals over 1500 results when run over all domains in the Common Crawl dataset. The exposure of these files to the wider Internet is often disastrous in terms of reputation, legality, and lost business.

Assessment: Finding Misidentified Content

Another common failure mode is for a misconfigured web server to serve php files as plain text, exposing their contents. The dataset can help find this scenario:

SELECT url
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2022-27'
and fetch_status=200
and content_mime_type != 'text/x-php'
and content_mime_detected = 'text/x-php'

This becomes much more useful when combined with the URL filter. Here, we look for files named config.php, which usually contain sensitive credentials. A simple example, but none the less relevant.

SELECT url
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2022-27'
and fetch_status=200
and url_path like '%config.php%'
and content_mime_type != 'text/x-php'
and content_mime_detected = 'text/x-php'

This query is particularly effective, locating hundreds of credentials on the wider web. The impact of this should be relatively obvious - but tl;dr'd "this is not fantastic".

Limitations

As always, life isn't perfect and nothing is without its faults - there are a few limitations I bumped into during my use of the Common Crawl index for nefarious purposes.

Firstly, obviously, querying such a resource is a computationally expensive task, and isn't free in terms of currency. I'm always wary of running up a large Athena bill by accident, although this hasn't happened yet (as of publishing). It should be noted that heavy users of the dataset can simply download the indexes and query some of the data locally.

Secondly, and perhaps most importantly, the coverage of the crawl is limited. While the Common Crawl project is a remarkable feat of engineering and computer science, it is obviously not able to snapshot the entire web on a continual basis.

Finally, the crawler that obtains data for the Common Crawl project is 'well-behaved', and observes such limitations as those requested in robots.txt. Boo!

Conclusion

I hope this post has left you feeling curious and inspired, as I was when I first discovered the Common Crawl project over ten years ago. I really do feel like the infosec community hasn't truly noticed this project, above any superficial uses, and thus doesn't pay it as much attention as it deserves.

The example queries we've used as examples are available at the watchTowr GitHub repository - please feel free to try them out and tinker with them! It is my hope that readers will find interesting new ways to use the dataset.

At watchTowr, we passionately believe that continuous security testing is the future and that rapid reaction to emerging threats single-handedly prevents inevitable breaches.

With the watchTowr Platform, we deliver this capability to our clients every single day - it is our job to understand how emerging threats, vulnerabilities, and TTPs could impact their organizations, with precision.

If you'd like to learn more about the watchTowr Platform, our Attack Surface Management and Continuous Automated Red Teaming solution, please get in touch.