Brief description
A website analyzer is a web/desktop/mobile app that samples a few pages of a website and gathers useful information.
Input
You are given a website address (also known as the URL)
Output
A linked tag cloud of all the significant terms in the website content
A dashboard containing the results of analysis of the website content.
Features – Essential
- Given a URL, find the home page of the website (sometimes you may be given a URL of an inner page)
- Find the links in the home page (and remove duplicates)
- Store the links
- For each link, get the web page, extract text and store it
- Parse the text (remove stop words, punctuation) and generate a list of uni-grams, and bi-grams from the text. We will refer to these as key terms.
- Create a tag cloud of the top 20 key terms.
Features – Desirable
- Customize the tag cloud (multiple fonts based – the more frequent the term, the higher the font)
- Make each key-term a hyper-link. You can display terms with higher frequency with a bigger font compared to the terms with lower frequency.
- Generate a JSON file with the information given below:
1. Number of pages at the top level
2. A list of hyperlinked titles and a list of 10 tags (from the page text)
- Display the statistics gathered in 9
Features – Nice to Have
- Include additional information in the JSON file
1. A list of contact addresses extracted from the site (if available)
2. A list of job positions on the site (if available)
3. A list of products or services the company offers (if available)
4. Social media links (if available)
- Display the statistics gathered in steps 9, and 11.