accessing web archives
Creating a panel dataset of historical website information from web archives.
This blog post shows how to create dynamic panel datasets covering information of corporate companies from archived web data. The proposed framework uses web data archived on Common Crawl and the Internet Archive. For illustration purposes, a web panel of the DAX 30 companies comprising web data over more than 10 years is created.
Why corporate websites?
Company websites pose an important source of economic data used by firms to spread product and service information (related to establishing a public image), to conduct transactions (e-business processes) and to ease opinion sharing (electronic word-of-mouth) (Blazquez & Domenech, 2018). Recent economic studies have used corporate website data to:- predict firm innovativeness (Gök et al., 2015, Kinne & Lenz, 2021; Axenbeck & Breithaupt, 2021)
- examine market entry strategies (Arora et al., 2013)
- examine enterprise growth (Li et al., 2016)
- monitor firm export orientation (Blazquez & Domenech, 2017)
- track crisis impacts on the corporate sector (Dörr et al., 2022)
- ...
Why panel data?
Firm characteristics, diffusion processes such as technological advances and adoption as well as business relations are clearly not static but evolve over time. It requires a continuous monitoring of corporate websites to capture this information. Most research studies require long time spans to derive meaningful insights and to control for unobserved heterogeneity across firms. For this reason, this post shows how to create a panel dataset of unstructured firm website information using Common Crawl. Common Crawl is an open repository of web data crawled over time from the universe of websites. The Common Crawl corpus contains 'petabytes of [...] raw web page data, metadata extracts and text extracts' and is freely accessible to anybody.In the following, we show how to retrieve mostly textual website information of a company at different points in time given its web address. For this purpose, we use cdx_toolkit a Python module for querying the Common Crawl corpus.
Retrieve historical company website information
First, we start to load the required Python modules:
import pandas as pd
import cdx_toolkit
Next, for demonstration purposes we create a sample list of corporate URLs:
urls = ['zew.de/*', 'sap.com/*', 'otaro-shop.com/*'] #zew.de/*, will return captures for any page on zew.de, e.g. zew.de/presse. Remove * to restrict captures to zew.de only
Instantiate the CDXFetcher() which allows to search Common Crawl's corpus for entries of a given corporate URL. Note that cdx_toolkit also allows querying the Internet Archive, a collection of archived web data ranging even further back in time than Common Crawl (more on this later).
client = cdx_toolkit.CDXFetcher(source='cc') # cc stands for Common Crawl
Now we can search Common Crawl's corpus for archived web data of the respective company websites.
In a first step, we only extract the metadata
including information such as the time of crawling (timestamp), the nature and format of the crawled document (mime), the HTTP status code (status) and information revealing the size of the web document in bytes
(offset) and character length (length).
- the year 2017
- successful crawling attempts (HTTP-status = 200)
- text data only (mime = text/html)
- and only extract 10 web pages archived in 2017.
requests = list(client.iter(urls[0], from_ts='201701', to='201712', limit=10, verbose='v', filter=['=status:200', '=mime-detected:text/html']))
metadata = pd.DataFrame([r.data for r in requests])
metadata.head(3)
urlkey | timestamp | url | mime | mime-detected | status | digest | length | offset | filename |
---|---|---|---|---|---|---|---|---|---|
de,zew)/ | 20171211154120 | http://www.zew.de/ | text/html | text/html | 200 | DFB4ZBE4QT4Y7CWATCK7Q7G5SSCBQHY6 | 18082 | 508576778 | crawl-data/CC-MAIN-2017-51/segments/1512948513... |
de,zew)/ | 20171212155354 | http://www.zew.de/ | text/html | text/html | 200 | OXTU2HOXOSJ2YBYRPFDIED77WJ543HBS | 17904 | 495470327 | crawl-data/CC-MAIN-2017-51/segments/1512948517... |
de,zew)/das-zew/aktuelles/gruendungen-in-baden... | 20171212095717 | http://www.zew.de/das-zew/aktuelles/gruendunge... | text/html | text/html | 200 | EDPWOZEBDOJ7L2A4TQ2JACYA4FXP3XKG | 14557 | 491201284 | crawl-data/CC-MAIN-2017-51/segments/1512948515... |
As it can be seen the metadata information provides an overview of what sort of archived web data is available for the given URL. Most importantly note, that only websites successfully archived in 2017 that include textual content have been requested. We have not yet download the actual content found on the website but only inspected the respective metadata. In a next step we will download the textual content found on the respective web page at the given point in time.
For this purpose, we import BeautifulSoup which allows to easily parse the HTML code of the web page.
from bs4 import BeautifulSoup
textdata = pd.DataFrame([BeautifulSoup(r.content, 'html.parser').get_text(strip=True) for r in requests], columns=['text'])
pd.concat([metadata, textdata], axis=1)[['timestamp', 'text']]
timestamp | text |
---|---|
20171211154120 | Zentrum für Europäische Wirtschaftsforschung (... |
20171212155354 | Zentrum für Europäische Wirtschaftsforschung (... |
20171212095717 | ZEW-Aktuell: Gründungen in Baden-Württemberg m... |
20171212100700 | ZEW-Aktuell: Neue Daten für eine effiziente Ve... |
20171211154059 | ZEW-Aktuell: Schüler-Teams zum regionalen YES!... |
20171213020528 | ZEW-Aktuell: ZEW Wirtschaftsforum 2016 – Markt... |
20171212102529 | 404Zur Navigation springenZum Seiteninhalt spr... |
20171217114955 | AnfahrtZur Navigation springenZum Seiteninhalt... |
20171215102245 | Aktuelle Meldungen - ZEW MannheimZur Navigatio... |
20171214063116 | ZEW-Aktuell: 19. ZEW Summer Workshop – Gestalt... |
We see that with the framework introduced here it is convenient to extract textual content found on corporate websites for a given timestamp. In the example, we have extracted webtexts of ZEW for the year 2017. With Common Crawl one can go further back in past extracting web content of a company up until 2008. In doing so, one can even be more precise in the date. We have restricted the search to the year 2017 only, but one could even be more restrictive extracting webdata archived in a specific month or even on a specific day (if the website has been archived on that day) - the internet does not forget. In this way, it becomes easily possible to obtain historical website information of any corporation.
With this approach it is easy to extend the panel going even further back in time. We will use web data archived in another web archive, the Internet Archive for this purpose.
client = cdx_toolkit.CDXFetcher(source='ia') # "ia" stands for Internet Archive
from tqdm import tqdm # allows to monitor progress of the execution
df = pd.DataFrame()
for y in tqdm(range(2010,2023)):
requests = list(client.iter(urls[0], from_ts=str(y), to=str(y), limit=10, verbose='v', filter=['status:200', 'mime:text/html']))
df_temp = pd.concat([pd.DataFrame([r.data for r in requests]), pd.DataFrame([BeautifulSoup(r.content, 'html.parser').get_text(strip=True) for r in requests], columns=['text'])], axis=1)
df_temp['year'] = df_temp.timestamp.apply(lambda x: str(x)[0:4])
df_temp = df_temp[['year', 'text']]
df = df.append(df_temp, ignore_index=True)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:24<00:00, 12.11s/it]
If we take a look at the data, we see that now we have 10 additional snapshots of ZEW's website for the years 2010-2022.
df
year | text |
---|---|
2010 | Zentrum für Europäische Wirtschaftsforschung G... |
2010 | Zentrum für Europäische Wirtschaftsforschung G... |
2010 | Zentrum für Europäische Wirtschaftsforschung G... |
2010 | Zentrum für Europäische Wirtschaftsforschung G... |
2010 | Zentrum für Europäische Wirtschaftsforschung G... |
... | ... |
2022 | ZEW - Zentrum für Europäische Wirtschaftsforsc... |
2022 | ZEW - Zentrum für Europäische Wirtschaftsforsc... |
2022 | ZEW - Zentrum für Europäische Wirtschaftsforsc... |
2022 | ZEW - Zentrum für Europäische Wirtschaftsforsc... |
2022 | ZEW - Zentrum für Europäische Wirtschaftsforsc... |
Overall, with the proposed framework we could extend the panel for ZEW comprising now web data from 2010 to 2022. A coverage of 13 years suits much more for economic research that looks at the diffusion of technologies over time, for instance.
Naturally, we can scale up the above approach to retrieve web content for a larger number of firms. Let us do so for the DAX 30 companies.
from tqdm import tqdm
from bs4 import BeautifulSoup
import requests
import re
First, we collect the URLs of all DAX companies. For convenience, I scrape them with the following lines of code.
landing = 'https://disfold.com/top-companies-germany-dax/'
page = requests.get(landing)
html = BeautifulSoup(page.content, 'html.parser')
dax30 = html.find('div', class_='entry-content').find_all('a', target='_blank', href=True)
urls = [re.search(r'\..{1,}\.(?:com|de|rwe)', d['href']).group(0)[1:] + '/*' for d in dax30]
Let us have a quick glimpse on the URLs of all DAX 30 companies.
urls
[ 'thyssenkrupp.com/*', 'covestro.com/*', 'lufthansa.com/*', 'merckgroup.com/*', 'wirecard.com/*', 'heidelbergcement.com/*', 'group.rwe/*', 'db.com/*', 'deutsche-boerse.com/*', 'eon.com/*', 'freseniusmedicalcare.com/*', 'infineon.com/*', 'vonovia.de/*', 'beiersdorf.com/*', 'fresenius.com/*', 'continental-corporation.com/*', 'munichre.com/*', 'henkel.com/*', 'dpdhl.com/*', 'adidas-group.com/*', 'bmw.com/*', 'bayer.com/*', 'daimler.com/*', 'basf.com/*', 'telekom.com/*', 'volkswagenag.com/*', 'siemens.com/*', 'allianz.com/*', 'lindeplc.com/*', 'sap.com/*' ]
Given the URLs, we can now construct a web data panel set for the DAX 30 companies applying the framework introduced above.
df = pd.DataFrame()
for c in tqdm(urls):
for y in (range(2010,2022)):
requests = list(client.iter(c, from_ts=str(y), to=str(y), limit=10, verbose='v', filter=['status:200', 'mime:text/html']))
df_temp = pd.concat([pd.DataFrame([r.data for r in requests]), pd.DataFrame([BeautifulSoup(r.content, 'html.parser').get_text(strip=True) for r in requests], columns=['text'])], axis=1)
if not df_temp.empty:
df_temp['year'] = df_temp.timestamp.apply(lambda x: str(x)[0:4])
df_temp['id'] = c
df_temp = df_temp[['id', 'year', 'text']]
df = df.append(df_temp, ignore_index=True)
In the table below we see how many distinct web pages for each of the 30 DAX companies in each of the years could have been retrieved with the proposed framework. Note that we have set the depth limit to 10 distinct web pages per company and year. Typically, for large companies, such as publicly listed firms, there are many more web pages archived.
df.rename(columns={'id': 'DAX 30 company', 'year': 'Year', 'text': 'Number of web pages'}).pivot_table(index='DAX 30 company', columns='Year', aggfunc=len, fill_value=0)
Year | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
DAX 30 company | Number of web pages | |||||||||||
adidas-group.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
allianz.com/* | 10 | 10 | 10 | 9 | 10 | 10 | 1 | 10 | 10 | 10 | 10 | 10 |
basf.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
bayer.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
beiersdorf.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
bmw.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
continental-corporation.com/* | 0 | 5 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 0 | 0 |
covestro.com/* | 7 | 0 | 0 | 1 | 0 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
daimler.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
db.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
deutsche-boerse.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
dpdhl.com/* | 10 | 0 | 0 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
eon.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
fresenius.com/* | 5 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
freseniusmedicalcare.com/* | 0 | 2 | 0 | 2 | 1 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
group.rwe/* | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 10 | 10 | 10 |
heidelbergcement.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
henkel.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
infineon.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
lindeplc.com/* | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 1 | 0 | 0 |
lufthansa.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
merckgroup.com/* | 0 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
munichre.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
sap.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
siemens.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
telekom.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
thyssenkrupp.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
volkswagenag.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
vonovia.de/* | 0 | 0 | 0 | 0 | 0 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
wirecard.com/* | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
Overall, we could build a web panel of DAX 30 companies spanning the last 12 years. The size of the web panel comprises
import sys
print(str(sys.getsizeof(df)/1000000) + ' MB')
Conclusion
This post has shown how to create a panel of web data using archived websites on Common Crawl or the Internet Archive. Web content changes over time and without the preservation efforts of web archives information from the web would get lost. Resembling past information from the web in a structured way allows to analyze dynamics and changing themes on the world wide web.