🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
Updated
Jun 14, 2023 - Python
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Collect and revisit web pages.
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Serverless Web Archive Replay directly in the browser
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Streaming WARC/ARC library for fast web archive IO
Bitextor generates translation memories from multilingual websites
News crawling with StormCrawler - stores content as WARC
Chrome extension to "Create WARC files from any webpage"
CoCrawler is a versatile web crawler built using modern tools and concurrency.
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Parse And Create Web ARChive (WARC) files with node.js
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."