[Hplusroadmap] Fwd: Internet crawl: CrawlDb getting big!
Bryan Bishop
kanzure at gmail.com
Tue May 6 20:06:11 CDT 2008
This goes back to the previous discussions on spiders and crawls over
the web. Re: the recent semantic web issues, spidering the aggregation
files around the web will eventually be needed, and nutch is a good
place to start. Has anyone heard from the search wikia people
recently? Jimmy kind of dropped off of that particular map when people
started to mention an NPOV search engine just doesn't make (niche)
sense.
http://lucene.apache.org
List of crawlbots:
http://heybryan.org/bookmarks/bookmarks-old2//Computation%20and%20the%20WWW/Search/index.html#9.10
- Bryan
---------- Forwarded message ----------
From: Mathijs Homminga <mathijs.homminga at knowlogy.nl>
Date: Tue, May 6, 2008 at 5:28 PM
Subject: Internet crawl: CrawlDb getting big!
To: nutch-dev at lucene.apache.org
Hi all,
The time needed to do a generate and an updatedb depends linearly on
the size of the CrawlDb.
Our CrawlDb currently contains about 1.5 billion urls (some fetched,
but most of them unfetched).
We are using Nutch 0.9 on a 15-node cluster. These are the times
needed for these jobs:
generate: 8-10 hours
updatedb: 8-10 hours
Our fetch job takes about 30 hours, in which we fetch and parse about
8 million docs (limited by our current bandwidth).
So, we spent about 40% of our time on CrawlDb administration.
The first problem for us was that we didn't make the best use of our
bandwidth (40% of the time no fetching). We solved this by designing a
system which looks a bit like the FetchCycleOverlap
(http://wiki.apache.org/nutch/FetchCycleOverlap) recently suggested by
Otis.
Another problem is that as the CrawlDb grows, the admin time
increases. One way to solve this is by increasing the topN each time
so the ratio between admin jobs and the fetch job remains constant.
However, we will end up with extreme long cycles and large segments.
Some of this we solved by generating multiple segments in one generate
job and only perform an updatedb when (almost) all of these segments
are fetched.
But still. The number of urls we select (generate), and the number of
urls we update (updatedb) is very small compared to the size of the
CrawlDb. We were wondering if there is a way such that we don't need
to read in the whole CrawlDb each time.
How about putting the CrawlDb in HBase? Sorting (generate) might
become a problem then...
Is this issue addressed in the Nutch2Architecture?
I'm happily willing to spend some more time on this, so all ideas are welcome.
Thanks,
Mathijs Homminga
--
Knowlogy
Helperpark 290 C
9723 ZA Groningen
The Netherlands
+31 (0)50 2103567
http://www.knowlogy.nl
mathijs.homminga at knowlogy.nl
+31 (0)6 15312977
More information about the Hplusroadmap
mailing list