The architecture shows the flow of data from the web all the
way down to the URL Link Rows and can be described as
follows:
1. Web-The web contains many websites at specific URLs.
2. JSoup-This open-source HTML parser can parse websites
in order to extract URLs.
3. URLs-URLs obtained from JSoup.
4. Tower Web Crawler- Originally designed to be a stand-alone
multi-threaded crawler that crawls the web using JSoup
to extract URLs and save the crawled website URLs in a
text file on a computer. With a few tweaks, this stand-alone
crawler was converted into a web crawler embedded
into a Spring Boot application running in the cloud.
5. Link Repository- The Link Repository extends JpaRepository
<Link, Long> and is used by the integrated crawler to save
URLs to the database through constructor dependency
injection. Where Link is the Hibernate Entity mapped to
the Link Table in the database.
6. MySQL/H2 Database-The MySQL and H2 Database correspond to
the two versions of the web application. Both contain the Link Table.
7. Link Table-The Link Table is created each time the
Spring Application starts up from schema.sql. The
Link Table simply contains rows of crawl data.
8. URL Link Rows-The URL Link Rows contain ids and
URLs crawled by the embedded web crawler.
References
Jonathan Hedley & jsoup contributors. jsoup: Java HTML
Parser (2009–present). Available at: https://jsoup.org
Spring Boot https://spring.io/projects/spring-boot
Hibernate https://hibernate.org
MySQL Database https://www.mysql.com
H2 Database https://h2database.com/html/main.html
Copyright © 2025 Tower Web Crawler - All Rights Reserved.
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.