The architecture shows the flow of data from the web all the
way down to the URL Link Rows and can be described as
follows:
1. Web-The web contains many websites at specific URLs.
2. JSoup-This open-source HTML parser can parse websites
in order to extract URLs.
3. URLs-URLs obtained from JSoup.
4. Tower Web Crawler- Originally designed to be a stand-alone
multi-threaded crawler that crawls the web using JSoup
to extract URLs and save the crawled website URLs in a
text file on a computer. With a few tweaks, this stand-alone
crawler was converted into an embedded web crawler integrated
into a Spring Boot application.
5. Link Repository- The Link Repository extends JpaRepository
<Link, Long> and is used by the integrated crawler to save
URLs to the H2 database through constructor dependency
injection. Where Link is the Hibernate Entity mapped to
the Link Table in the H2 Database.
6. H2 Database-This H2 Database is in-memory and contains
the Link Table.
7. Link Table-The Link Table is created each time the
Spring Application starts up from schema.sql. The
Link Table simply contains rows of crawl data.
8. URL Link Rows-The URL Link Rows contain ids and
URLs crawled by the embedded web crawler.
References
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.