Webarchive, Metadaten und Forschung
Impulsbeitrag SDC-AG Metadaten, 17.02.2022
Julian Dörr, BERD@BW
Archivierung von Webseiten
Metadaten einer archivierten Webseite
Crawl/Capture inDeX (CDX):
Beispiel: Ein Memento von tagesschau.de
Beispiel: Ein Memento von tagesschau.de
... und seine Metadaten
Die wichtigsten Metadaten:
de,tagesschau)/
20000816064435
Weitere Metadaten (für API Abfrage):
http://www.tagesschau.de/
text/html
200
37cf167c2672a4a64af901d9484e75eee0e2c98a
12860
Metadaten als Forschungsgegenstand?
By focusing on creating a “representative sample” of the web at large, rather than attempting to capture a single site in its entirety [...], the crawl self-limits itself to being applicable [...] to macro-level research examining web scale questions.
-- Sara Crouse, Director, Common Crawl
Beispiel: Kommunikation auf Unternehmenswebseiten
Ausschnitt der zuvor vorgestellten Metadaten für die 40 Unternehmen im Deutschen Aktienindex (DAX).
dax_cdx.sample(3)
firm | year | urlkey | timestamp | original | mimetype | statuscode | digest | length | csr_url | |
---|---|---|---|---|---|---|---|---|---|---|
353327 | siemens.com/* | 2010 | com,siemens)/answers/cc/en/index.htm | 20100303165936 | http://www.siemens.com:80/answers/cc/en/index.... | text/html | 200 | OBH4PPPZ4UQWAYVNSPAJWGAQX47FJKFF | 6878 | False |
184564 | telekom.com/* | 2018 | com,telekom)/de/medien/medieninformationen/det... | 20180312225836 | https://www.telekom.com/de/medien/medieninform... | text/html | 200 | 5ZXPYABWKHKW5Q4NNPIFEWIEFAFQPZ4I | 13650 | False |
210928 | fresenius.de/* | 2005 | de,fresenius)/e/2/e_02_4.html | 20050211034333 | http://www.fresenius.de:80/e/2/e_02_4.html | text/html | 200 | UXPKXNPLX5SGDLNLVQR462TUQLPJ7HAT | 3588 | False |
Einführung einer (unvollständigen) Schlagwortliste zum Thema "Nachhaltigkeit" ...
csr_list = ['sustain', 'nachhaltig']
... mit anschließender Schlagwortsuche in den urlkeys
Herausforderung Webarchive
Zusammenfassung
Vielen Dank für Ihre Aufmerksamkeit!
Appendix
'collapse': 'urlkey',
'filter': ['statuscode:200', 'mimetype:text/html']
pd.pivot_table(dax_cdx[['firm', 'year']], index=['firm'], columns=['year'], aggfunc=lambda x: str(len(x)), fill_value='')
year | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
firm | |||||||||||||||||||||
adidas-group.com/* | 193 | 1000 | 118 | 964 | ... | 1000 | 831 | 1000 | 1000 | 797 | 235 | 173 | 947 | 1000 | |||||||
airbus.com/* | 406 | 471 | 838 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | |
allianz.com/* | 73 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 8 | 8 | 10 | 1 | 189 | 117 | 1000 | 1000 | |
basf.com/* | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1 | ... | 190 | 109 | 279 | 573 | 89 | 1000 | 908 | 1000 | 1000 | ||
bayer.de/* | 226 | 1000 | 1000 | 1000 | 1000 | 1000 | 418 | 580 | 283 | ... | 376 | 363 | 306 | 388 | 361 | 282 | 239 | 283 | 312 | ||
beiersdorf.de/* | 970 | 258 | 439 | 642 | 640 | 714 | 355 | 784 | 758 | 913 | ... | 729 | 671 | 495 | 536 | 813 | 36 | 227 | 1000 | 754 | |
bmwgroup.com/* | 60 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 412 | 232 | 251 | 893 | 1000 | |
brenntag.com/* | 2 | 412 | 1000 | 1000 | 1000 | 914 | 322 | 390 | 184 | ... | 265 | 300 | 234 | 546 | 1000 | 1000 | 258 | 1000 | |||
continental.com/* | 1000 | 1000 | 1000 | 1000 | 465 | 286 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 375 | 1000 | 1000 | ||||||
corporate.zalando.com/* | ... | 5 | 107 | 247 | 275 | 162 | 281 | 1000 | 1000 | ||||||||||||
covestro.com/* | ... | 1 | 328 | 593 | 568 | 150 | 226 | 1000 | 1000 | ||||||||||||
deliveryhero.com/* | ... | 41 | 26 | 68 | 50 | 65 | 111 | 342 | 424 | 365 | |||||||||||
deutsche-bank.de/* | 24 | 48 | 631 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 722 | 1000 | 1000 | 468 | 405 | 412 | 311 | 1000 | |
deutsche-boerse.com/* | 1000 | 1000 | 1000 | 1000 | 496 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ||
dpdhl.com/* | 386 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 424 | 960 | 1000 | ||||||||||
eon.com/* | 189 | 1000 | 1000 | 80 | 99 | 167 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 264 | 938 | 1000 | 1000 |
fresenius.de/* | 490 | 1000 | 1000 | 1000 | 1000 | 1000 | 238 | 463 | 617 | 703 | ... | 1000 | 456 | 456 | 605 | 101 | 142 | 28 | 152 | 389 | 344 |
freseniusmedicalcare.com/* | 362 | 165 | 276 | 11 | 13 | 7 | 15 | 91 | ... | 1 | 72 | 445 | 197 | 328 | 881 | 847 | |||||
group.rwe/* | ... | 58 | 466 | 1000 | 1000 | ||||||||||||||||
heidelbergcement.com/* | 52 | 755 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 480 | 374 | 421 | 486 | |||
hellofresh.de/* | ... | 140 | 288 | 292 | 71 | 63 | 52 | 173 | 125 | 1000 | 1000 | ||||||||||
henkel.de/* | 289 | 321 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 246 | 444 | 1000 | 1000 |
infineon.com/* | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 976 | 1000 | 1000 | |
lindeplc.com/* | ... | 1 | |||||||||||||||||||
merckgroup.com/* | 1 | 1 | 40 | ... | 529 | 116 | 196 | 206 | 209 | 208 | 13 | 23 | 255 | 1000 | |||||||
mtu.de/* | 320 | 4 | 9 | 1000 | 1000 | 992 | 1000 | 631 | 969 | ... | 981 | 941 | 1000 | 818 | 1000 | 1000 | 888 | 803 | 1000 | 766 | |
munichre.com/* | 198 | 555 | 911 | 53 | 37 | 530 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 967 | 1000 | 1000 | 1000 | |||
porsche-se.com/* | 27 | 213 | 236 | ... | 218 | 216 | 162 | 368 | 84 | 18 | 70 | 873 | 106 | ||||||||
puma.de/* | 15 | 207 | 775 | 537 | 452 | 36 | 9 | 6 | 1 | ... | 1 | 1 | |||||||||
qiagen.com/* | 39 | 123 | 156 | 352 | 193 | 367 | 403 | 239 | 15 | 11 | ... | 77 | 251 | 454 | 57 | 53 | 1000 | 204 | 159 | 1000 | 1000 |
sap.de/* | 6 | 6 | 1 | 6 | ... | ||||||||||||||||
sartorius.com/* | 252 | 1000 | 1000 | 1000 | 1000 | 1000 | 293 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 35 | 25 | 647 | 1000 | 1000 | |||
siemens-energy.com/* | ... | 1 | 2 | 1000 | 1000 | ||||||||||||||||
siemens-healthineers.com/* | ... | 836 | 1000 | 1000 | |||||||||||||||||
siemens.com/* | 1000 | 1000 | 1000 | 1000 | 196 | 1000 | 1000 | 1000 | 1000 | 2 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 216 | 815 | 1000 |
symrise.com/* | 302 | 618 | 607 | 556 | 202 | 70 | ... | 203 | 184 | 393 | 1000 | 975 | 492 | 29 | 664 | 846 | |||||
telekom.com/* | 15 | 12 | 7 | 10 | 7 | 2 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 985 | 1000 | 1000 | 1000 | 1000 | |
volkswagenag.com/* | 1 | 75 | 126 | 913 | 1000 | 372 | ... | 475 | 736 | 1000 | 1000 | 837 | 680 | 1000 | 1000 | 1000 | |||||
vonovia.de/* | ... | 54 | 312 | 223 | 259 | 659 | 1000 |
39 rows × 22 columns