About

The architecture shows the flow of data from the web all the

way down to the URL Link Rows and can be described as

follows:

1. Web-The web contains many websites at specific URLs.

2. JSoup-This open-source HTML parser can parse websites

in order to extract URLs.

3. URLs-URLs obtained from JSoup.

4. Tower Web Crawler- Originally designed to be a stand-alone

multi-threaded crawler that crawls the web using JSoup

to extract URLs and save the crawled website URLs in a

text file on a computer. With a few tweaks, this stand-alone

crawler was converted into an embedded web crawler integrated

into a Spring Boot application.

5. Link Repository- The Link Repository extends JpaRepository

<Link, Long> and is used by the integrated crawler to save

URLs to the H2 database through constructor dependency

injection. Where Link is the Hibernate Entity mapped to

the Link Table in the H2 Database.

6. H2 Database-This H2 Database is in-memory and contains

the Link Table.

7. Link Table-The Link Table is created each time the

Spring Application starts up from schema.sql. The

Link Table simply contains rows of crawl data.

8. URL Link Rows-The URL Link Rows contain ids and

URLs crawled by the embedded web crawler.

References

Jonathan Hedley & jsoup contributors. jsoup: Java HTML

Parser (2009–present). Available at: https://jsoup.org

Spring Boot https://spring.io/projects/spring-boot

Hibernate https://hibernate.org/

H2 Database https://h2database.com/html/main.html