Question:
How do I build a Web crawler aka Web scraper for data mining?
Mesmer
2006-12-17 10:10:34 UTC
I am a beginner (newb) and need to know all the steps to do this. My goal is to mine 100 sites for information on a daily basis. Do I need a server? Can I do it from my home computer? Years ago I took a PERL course...but should I use Python? Please point me to a resource besides GOOGLE which I have spent hours researching. The answers I have found so far are over my head. I even got the book Spidering Hacks by O'Reilly. I need the basics of where to start. THANKS!
Four answers:
Mation
2006-12-17 10:35:45 UTC
One article that I found useful begins with very simple examples and then builds on them. The article is Linux centric and the examples are written in Ruby but you should be able to pick up some useful information.



There are also some links at the end to point you in the direction of other useful pages.
2016-03-13 11:55:49 UTC
I have some experience with this so I'll give you some pointers: Be careful with repeated web requests, they can create a DOS attack and knock sites offline. Be sure to introduce some artificial time delays so that you don't send too much traffic to a website too fast. Don't try to bypass a CAPTCHA as you might be in violation of the law if you do so. As stated above, don't claim the work is yours. Other than that, no it's not illegal if the information is in the public domain.
csanon
2006-12-17 10:23:27 UTC
You need a language with networking capabilities. That's a lot of choices :C, C++, C#, VB.NET, Perl, Python, Ruby, etc.



If the answers you find are over head, you don't have the basics down. Learn to walk before you try the 400-meter sprints.



So take a language of choice, if you know Perl, work with that and polish that. I use Python and C++, others would use Java, or C#. *Pick one*, *learn it*. Then the answers won't be over your head.
jake cigarâ„¢ is retired
2006-12-17 13:47:13 UTC
you want basic, you want to learn , you know perl... CPAN has many versions of http: access and html parsing. LWP is just the beginning!



check out http://cpan.org



Mmmm perl!


This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.
Loading...