|
| 1 | +A programming excercise. A web crawler written in python. |
| 2 | + |
| 3 | +TODO |
| 4 | +don't process a url already processed. |
| 5 | +the complex part. |
| 6 | +figure out how to do tests first. |
| 7 | +error handling. |
| 8 | + |
| 9 | +Thoughts |
| 10 | +It took 2 hours. |
| 11 | +I have never seen python before. |
| 12 | +python seems to make sense. |
| 13 | +finding the correct regex method was tricky. |
| 14 | +casting numbers to strings was annoying. |
| 15 | + |
| 16 | +Problem statement: |
| 17 | +Language: Python |
| 18 | +Official site: http://python.org/ |
| 19 | +Beginer's guide: http://wiki.python.org/moin/BeginnersGuide |
| 20 | +Tutorial: http://docs.python.org/tutorial/ |
| 21 | + |
| 22 | +App Category: Networking |
| 23 | +DB: None |
| 24 | +Simple part: |
| 25 | +Create a web crawler app in python which, given a url seed, can crawl |
| 26 | +through all links on the page and scan deep for a given level of |
| 27 | +depth. While crawling the app should be able to return the url page |
| 28 | +containing a specific search text. |
| 29 | +Input: |
| 30 | +1 - Url seed e.g. www.hackernews.com |
| 31 | +2 - Depth e.g. 5 (this means go into links on a page till 5 levels) |
| 32 | +3 - search text e.g. "python" |
| 33 | +Output: |
| 34 | +the list of url that contains the specified text |
| 35 | +The Simple part is mandatory to be completed. |
| 36 | +If you finish the simple part and are eager to take up something |
| 37 | +challenging, then here's a little complex angle to the problem: |
| 38 | +Complex part: |
| 39 | +Write rules around the app for searching. |
| 40 | +Rule 1: The return Url should contain a specific substring |
| 41 | +Rule 2: Highlight in output if the url is amongst a long list of |
| 42 | +blacklisted urls (about 10000 blacklisted urls) |
| 43 | +Rule 3: Search for multiple search strings and rank Urls as per the |
| 44 | +number of different search strings found and occurances of each search |
| 45 | +string in the page |
| 46 | +Rule 4: Rank as per level of the Url w.r.t. seed url |
0 commit comments