Tower Web Crawler

Tower Web CrawlerTower Web CrawlerTower Web Crawler

About

Tower Embedded Web Crawler Architecture

The architecture shows the flow of data from the web all the 

way down to the URL Link Rows and can be described as 

follows:


1. Web-The web contains many websites at specific URLs.

2. JSoup-This open-source HTML parser can parse websites

in order to extract URLs.

3. URLs-URLs obtained from JSoup.

4. Tower Web Crawler- Originally designed to be a stand-alone

multi-threaded crawler that crawls the web using JSoup

to extract URLs and save the crawled website URLs in a 

text file on a computer. With a few tweaks, this stand-alone 

crawler was converted into an embedded web crawler integrated

into a Spring Boot application.

5. Link Repository- The Link Repository extends JpaRepository

<Link, Long> and is used by the integrated crawler to save

URLs to the H2 database through constructor dependency

injection. Where Link is the Hibernate Entity mapped to

the Link Table in the H2 Database.

6. H2 Database-This H2 Database is in-memory and contains

the Link Table.

7. Link Table-The Link Table is created each time the

Spring Application starts up from schema.sql. The

Link Table simply contains rows of crawl data.

8. URL Link Rows-The URL Link Rows contain ids and

URLs crawled by the embedded web crawler.


References

Jonathan Hedley & jsoup contributors. jsoup: Java HTML 

Parser (2009–present). Available at: https://jsoup.org

Spring Boot https://spring.io/projects/spring-boot

Hibernate https://hibernate.org/

H2 Database https://h2database.com/html/main.html

Contact Me

developer@towerwebcrawler.com

Drop me a line!

Attach Files
Attachments (0)

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Cancel

Copyright © 2025 Tower Web Crawler - All Rights Reserved.


Powered by

This website uses cookies.

We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.

Accept