Create_Web_Panel

Author

Julian Oliver Dörr

This blog post shows how to create dynamic panel datasets covering information of corporate companies from archived web data. The proposed framework uses web data archived on Common Crawl and the Internet Archive. For illustration purposes, a web panel of the DAX 30 companies comprising web data over more than 10 years is created.

Why corporate websites?

Company websites pose an important source of economic data used by firms to spread product and service information (related to establishing a public image), to conduct transactions (e-business processes) and to ease opinion sharing (electronic word-of-mouth) (Blazquez & Domenech, 2018). Recent economic studies have used corporate website data to:

predict firm innovativeness (Gök et al., 2015, Kinne & Lenz, 2021; Axenbeck & Breithaupt, 2021)
examine market entry strategies (Arora et al., 2013)
examine enterprise growth (Li et al., 2016)
monitor firm export orientation (Blazquez & Domenech, 2017)
track crisis impacts on the corporate sector (Dörr et al., 2022)
...

Why panel data?

Firm characteristics, diffusion processes such as technological advances and adoption as well as business relations are clearly not static but evolve over time. It requires a continuous monitoring of corporate websites to capture this information. Most research studies require long time spans to derive meaningful insights and to control for unobserved heterogeneity across firms. For this reason, this post shows how to create a panel dataset of unstructured firm website information using Common Crawl. Common Crawl is an open repository of web data crawled over time from the universe of websites. The Common Crawl corpus contains 'petabytes of [...] raw web page data, metadata extracts and text extracts' and is freely accessible to anybody.

In the following, we show how to retrieve mostly textual website information of a company at different points in time given its web address. For this purpose, we use cdx_toolkit a Python module for querying the Common Crawl corpus.

Retrieve historical company website information

First, we start to load the required Python modules:

                                        import pandas as pd
import cdx_toolkit

Next, for demonstration purposes we create a sample list of corporate URLs:

                                        urls = ['zew.de/*', 'sap.com/*', 'otaro-shop.com/*'] #zew.de/*, will return captures for any page on zew.de, e.g. zew.de/presse. Remove * to restrict captures to zew.de only

                                    

Instantiate the CDXFetcher() which allows to search Common Crawl's corpus for entries of a given corporate URL. Note that cdx_toolkit also allows querying the Internet Archive, a collection of archived web data ranging even further back in time than Common Crawl (more on this later).

                                        client = cdx_toolkit.CDXFetcher(source='cc') # cc stands for Common Crawl

                                    

Now we can search Common Crawl's corpus for archived web data of the respective company websites.
In a first step, we only extract the metadata including information such as the time of crawling (timestamp), the nature and format of the crawled document (mime), the HTTP status code (status) and information revealing the size of the web document in bytes (offset) and character length (length).

Note that we restrict our requests to:

the year 2017
successful crawling attempts (HTTP-status = 200)
text data only (mime = text/html)
and only extract 10 web pages archived in 2017.

                                        requests = list(client.iter(urls[0], from_ts='201701', to='201712', limit=10, verbose='v', filter=['=status:200', '=mime-detected:text/html']))

                                    

                                        metadata = pd.DataFrame([r.data for r in requests])
metadata.head(3)

urlkey	timestamp	url	mime	mime-detected	status	digest	length	offset	filename
de,zew)/	20171211154120	http://www.zew.de/	text/html	text/html	200	DFB4ZBE4QT4Y7CWATCK7Q7G5SSCBQHY6	18082	508576778	crawl-data/CC-MAIN-2017-51/segments/1512948513...
de,zew)/	20171212155354	http://www.zew.de/	text/html	text/html	200	OXTU2HOXOSJ2YBYRPFDIED77WJ543HBS	17904	495470327	crawl-data/CC-MAIN-2017-51/segments/1512948517...
de,zew)/das-zew/aktuelles/gruendungen-in-baden...	20171212095717	http://www.zew.de/das-zew/aktuelles/gruendunge...	text/html	text/html	200	EDPWOZEBDOJ7L2A4TQ2JACYA4FXP3XKG	14557	491201284	crawl-data/CC-MAIN-2017-51/segments/1512948515...

As it can be seen the metadata information provides an overview of what sort of archived web data is available for the given URL. Most importantly note, that only websites successfully archived in 2017 that include textual content have been requested. We have not yet download the actual content found on the website but only inspected the respective metadata. In a next step we will download the textual content found on the respective web page at the given point in time.

For this purpose, we import BeautifulSoup which allows to easily parse the HTML code of the web page.

                                        from bs4 import BeautifulSoup
textdata = pd.DataFrame([BeautifulSoup(r.content, 'html.parser').get_text(strip=True) for r in requests], columns=['text'])

                                        pd.concat([metadata, textdata], axis=1)[['timestamp', 'text']]

                                    

timestamp	text
20171211154120	Zentrum für Europäische Wirtschaftsforschung (...
20171212155354	Zentrum für Europäische Wirtschaftsforschung (...
20171212095717	ZEW-Aktuell: Gründungen in Baden-Württemberg m...
20171212100700	ZEW-Aktuell: Neue Daten für eine effiziente Ve...
20171211154059	ZEW-Aktuell: Schüler-Teams zum regionalen YES!...
20171213020528	ZEW-Aktuell: ZEW Wirtschaftsforum 2016 – Markt...
20171212102529	404Zur Navigation springenZum Seiteninhalt spr...
20171217114955	AnfahrtZur Navigation springenZum Seiteninhalt...
20171215102245	Aktuelle Meldungen - ZEW MannheimZur Navigatio...
20171214063116	ZEW-Aktuell: 19. ZEW Summer Workshop – Gestalt...

We see that with the framework introduced here it is convenient to extract textual content found on corporate websites for a given timestamp. In the example, we have extracted webtexts of ZEW for the year 2017. With Common Crawl one can go further back in past extracting web content of a company up until 2008. In doing so, one can even be more precise in the date. We have restricted the search to the year 2017 only, but one could even be more restrictive extracting webdata archived in a specific month or even on a specific day (if the website has been archived on that day) - the internet does not forget. In this way, it becomes easily possible to obtain historical website information of any corporation.

With this approach it is easy to extend the panel going even further back in time. We will use web data archived in another web archive, the Internet Archive for this purpose.

                                        client = cdx_toolkit.CDXFetcher(source='ia') # "ia" stands for Internet Archive

                                    

                                        from tqdm import tqdm # allows to monitor progress of the execution
df = pd.DataFrame()
for y in tqdm(range(2010,2023)):
    requests = list(client.iter(urls[0], from_ts=str(y), to=str(y), limit=10, verbose='v', filter=['status:200', 'mime:text/html']))
    df_temp = pd.concat([pd.DataFrame([r.data for r in requests]), pd.DataFrame([BeautifulSoup(r.content, 'html.parser').get_text(strip=True) for r in requests], columns=['text'])], axis=1)
    df_temp['year'] = df_temp.timestamp.apply(lambda x: str(x)[0:4])
    df_temp = df_temp[['year', 'text']]
    df = df.append(df_temp, ignore_index=True)

                                    

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:24<00:00, 12.11s/it]

If we take a look at the data, we see that now we have 10 additional snapshots of ZEW's website for the years 2010-2022.

df

year	text
2010	Zentrum für Europäische Wirtschaftsforschung G...
2010	Zentrum für Europäische Wirtschaftsforschung G...
2010	Zentrum für Europäische Wirtschaftsforschung G...
2010	Zentrum für Europäische Wirtschaftsforschung G...
2010	Zentrum für Europäische Wirtschaftsforschung G...
...	...
2022	ZEW - Zentrum für Europäische Wirtschaftsforsc...
2022	ZEW - Zentrum für Europäische Wirtschaftsforsc...
2022	ZEW - Zentrum für Europäische Wirtschaftsforsc...
2022	ZEW - Zentrum für Europäische Wirtschaftsforsc...
2022	ZEW - Zentrum für Europäische Wirtschaftsforsc...

Overall, with the proposed framework we could extend the panel for ZEW comprising now web data from 2010 to 2022. A coverage of 13 years suits much more for economic research that looks at the diffusion of technologies over time, for instance.

Scaling up further: Extract historic web data for more than just one company

Naturally, we can scale up the above approach to retrieve web content for a larger number of firms. Let us do so for the DAX 30 companies.

                                        from tqdm import tqdm
from bs4 import BeautifulSoup
import requests
import re

                                    

First, we collect the URLs of all DAX companies. For convenience, I scrape them with the following lines of code.

                                        landing = 'https://disfold.com/top-companies-germany-dax/'
page = requests.get(landing)
html = BeautifulSoup(page.content, 'html.parser')
dax30 = html.find('div', class_='entry-content').find_all('a', target='_blank', href=True)
urls = [re.search(r'\..{1,}\.(?:com|de|rwe)', d['href']).group(0)[1:] + '/*' for d in dax30]

                                    

Let us have a quick glimpse on the URLs of all DAX 30 companies.

urls

 [ 'thyssenkrupp.com/*', 'covestro.com/*', 'lufthansa.com/*', 'merckgroup.com/*',
 'wirecard.com/*', 'heidelbergcement.com/*', 'group.rwe/*', 'db.com/*',
 'deutsche-boerse.com/*', 'eon.com/*', 'freseniusmedicalcare.com/*', 'infineon.com/*',
 'vonovia.de/*', 'beiersdorf.com/*', 'fresenius.com/*', 'continental-corporation.com/*',
 'munichre.com/*', 'henkel.com/*', 'dpdhl.com/*', 'adidas-group.com/*',
 'bmw.com/*', 'bayer.com/*', 'daimler.com/*', 'basf.com/*',
 'telekom.com/*', 'volkswagenag.com/*', 'siemens.com/*', 'allianz.com/*',
 'lindeplc.com/*', 'sap.com/*' ]

Given the URLs, we can now construct a web data panel set for the DAX 30 companies applying the framework introduced above.

                                        df = pd.DataFrame()
for c in tqdm(urls):
    for y in (range(2010,2022)):
        requests = list(client.iter(c, from_ts=str(y), to=str(y), limit=10, verbose='v', filter=['status:200', 'mime:text/html']))
        df_temp = pd.concat([pd.DataFrame([r.data for r in requests]), pd.DataFrame([BeautifulSoup(r.content, 'html.parser').get_text(strip=True) for r in requests], columns=['text'])], axis=1)
        if not df_temp.empty:
            df_temp['year'] = df_temp.timestamp.apply(lambda x: str(x)[0:4])
            df_temp['id'] = c
            df_temp = df_temp[['id', 'year', 'text']]
            df = df.append(df_temp, ignore_index=True)

                                    

In the table below we see how many distinct web pages for each of the 30 DAX companies in each of the years could have been retrieved with the proposed framework. Note that we have set the depth limit to 10 distinct web pages per company and year. Typically, for large companies, such as publicly listed firms, there are many more web pages archived.

                                        df.rename(columns={'id': 'DAX 30 company', 'year': 'Year', 'text': 'Number of web pages'}).pivot_table(index='DAX 30 company', columns='Year', aggfunc=len, fill_value=0)

                                    

Year	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
DAX 30 company	Number of web pages
adidas-group.com/*	10	10	10	10	10	10	10	10	10	10	10	10
allianz.com/*	10	10	10	9	10	10	1	10	10	10	10	10
basf.com/*	10	10	10	10	10	10	10	10	10	10	10	10
bayer.com/*	10	10	10	10	10	10	10	10	10	10	10	10
beiersdorf.com/*	10	10	10	10	10	10	10	10	10	10	10	10
bmw.com/*	10	10	10	10	10	10	10	10	10	10	10	10
continental-corporation.com/*	0	5	10	10	10	10	10	10	10	10	0	0
covestro.com/*	7	0	0	1	0	10	10	10	10	10	10	10
daimler.com/*	10	10	10	10	10	10	10	10	10	10	10	10
db.com/*	10	10	10	10	10	10	10	10	10	10	10	10
deutsche-boerse.com/*	10	10	10	10	10	10	10	10	10	10	10	10
dpdhl.com/*	10	0	0	10	10	10	10	10	10	10	10	10
eon.com/*	10	10	10	10	10	10	10	10	10	10	10	10
fresenius.com/*	5	10	10	10	10	10	10	10	10	10	10	10
freseniusmedicalcare.com/*	0	2	0	2	1	10	10	10	10	10	10	10
group.rwe/*	0	0	0	0	0	0	0	0	10	10	10	10
heidelbergcement.com/*	10	10	10	10	10	10	10	10	10	10	10	10
henkel.com/*	10	10	10	10	10	10	10	10	10	10	10	10
infineon.com/*	10	10	10	10	10	10	10	10	10	10	10	10
lindeplc.com/*	0	0	0	0	0	0	0	0	10	1	0	0
lufthansa.com/*	10	10	10	10	10	10	10	10	10	10	10	10
merckgroup.com/*	0	10	10	10	10	10	10	10	10	10	10	10
munichre.com/*	10	10	10	10	10	10	10	10	10	10	10	10
sap.com/*	10	10	10	10	10	10	10	10	10	10	10	10
siemens.com/*	10	10	10	10	10	10	10	10	10	10	10	10
telekom.com/*	10	10	10	10	10	10	10	10	10	10	10	10
thyssenkrupp.com/*	10	10	10	10	10	10	10	10	10	10	10	10
volkswagenag.com/*	10	10	10	10	10	10	10	10	10	10	10	10
vonovia.de/*	0	0	0	0	0	10	10	10	10	10	10	10
wirecard.com/*	10	10	10	10	10	10	10	10	10	10	10	10

Overall, we could build a web panel of DAX 30 companies spanning the last 12 years. The size of the web panel comprises

                                        import sys
print(str(sys.getsizeof(df)/1000000) + ' MB')

35.98 MB

Conclusion

This post has shown how to create a panel of web data using archived websites on Common Crawl or the Internet Archive. Web content changes over time and without the preservation efforts of web archives information from the web would get lost. Resembling past information from the web in a structured way allows to analyze dynamics and changing themes on the world wide web.