Skip to content

Commit f3f74fa

Browse files
committed
added README
1 parent fc23c1b commit f3f74fa

1 file changed

Lines changed: 46 additions & 0 deletions

File tree

README

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
A programming excercise. A web crawler written in python.
2+
3+
TODO
4+
don't process a url already processed.
5+
the complex part.
6+
figure out how to do tests first.
7+
error handling.
8+
9+
Thoughts
10+
It took 2 hours.
11+
I have never seen python before.
12+
python seems to make sense.
13+
finding the correct regex method was tricky.
14+
casting numbers to strings was annoying.
15+
16+
Problem statement:
17+
Language: Python
18+
Official site: http://python.org/
19+
Beginer's guide: http://wiki.python.org/moin/BeginnersGuide
20+
Tutorial: http://docs.python.org/tutorial/
21+
22+
App Category: Networking
23+
DB: None
24+
Simple part:
25+
Create a web crawler app in python which, given a url seed, can crawl
26+
through all links on the page and scan deep for a given level of
27+
depth. While crawling the app should be able to return the url page
28+
containing a specific search text.
29+
Input:
30+
1 - Url seed e.g. www.hackernews.com
31+
2 - Depth e.g. 5 (this means go into links on a page till 5 levels)
32+
3 - search text e.g. "python"
33+
Output:
34+
the list of url that contains the specified text
35+
The Simple part is mandatory to be completed.
36+
If you finish the simple part and are eager to take up something
37+
challenging, then here's a little complex angle to the problem:
38+
Complex part:
39+
Write rules around the app for searching.
40+
Rule 1: The return Url should contain a specific substring
41+
Rule 2: Highlight in output if the url is amongst a long list of
42+
blacklisted urls (about 10000 blacklisted urls)
43+
Rule 3: Search for multiple search strings and rank Urls as per the
44+
number of different search strings found and occurances of each search
45+
string in the page
46+
Rule 4: Rank as per level of the Url w.r.t. seed url

0 commit comments

Comments
 (0)