abot
请播放这个项目!
C#Web爬行者为速度和灵活性而构建。
abot是为速度和灵活性而建立的开源C#Web爬网框架。它需要负责低级别的管道(多线程,HTTP请求,调度,链接解析等)。您只需注册即可处理页面数据的事件。您还可以插入自己的核心接口实现,以完全控制爬网过程。 abot nuget软件包版本> = 2.0目标dotnet标准2.0和abot nuget软件包版本<2.0 Targets .NET版本4.0,这使其与许多.NET框架/核心实现都高度兼容。
有什么好处?
- 开源(用于商业和个人使用免费)
- 快速,真的很快!!
- 易于自定义(可插入式体系结构使您可以确定什么被爬行以及如何)
- 经过大量测试(高码覆盖范围)
- 非常轻巧(不超过工程)
- 没有流程依赖性(没有数据库,没有安装服务等…)
感兴趣的链接
- 问一个问题,请先搜索类似的问题!!!
- 报告一个错误
- 了解如何贡献
- 需要专家abot定制吗?
- 进行用法调查以帮助优先级功能/改进
- 考虑捐款
使用abot X进行更强大的扩展/包装器
- 同时爬行多个站点
- 执行/渲染JavaScript
- 避免被网站阻塞
- 自动调整
- 自动节流
- 暂停/简历实时爬网
- 简化的可实力/可扩展性
快速开始
安装abot
- 使用Nuget安装abot
PM > Install-Package abot
使用abot
abot2.Core;
using abot 2.Crawler;
using abot 2.Poco;
using Serilog;
namespace Test abot Use
{
class Program
{
static async Task Main(string[] args)
{
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.WriteTo.Console()
.CreateLogger();
Log.Logger.Information(\”Demo starting up!\”);
await DemoSimpleCrawler();
await DemoSinglePageRequest();
}
private static async Task DemoSimpleCrawler()
{
var config = new CrawlConfiguration
{
MaxPagesToCrawl = 10, //Only crawl 10 pages
MinCrawlDelayPerDomainMilliSeconds = 3000 //Wait this many millisecs between requests
};
var crawler = new PoliteWebCrawler(config);
crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available…
var crawlResult = await crawler.CrawlAsync(new Uri(\”http://**!!!!*!!!!YOURSITEHERE!!!!!!!!!.com\”));
}
private static async Task DemoSinglePageRequest()
{
var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());
var crawledPage = await pageRequester.MakeRequestAsync(new Uri(\”http://goo***gle.com\”));
Log.Logger.Information(\”{result}\”, new
{
url = crawledPage.Uri,
status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
});
}
private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
var httpStatus = e.CrawledPage.HttpResponseMessage.StatusCode;
var rawPageText = e.CrawledPage.Content.Text;
}
}
}
\”>
using System ; using System . Threading . Tasks ; using abot 2 . Core ; using abot 2 . Crawler ; using abot 2 . Poco ; using Serilog ; namespace Test abot Use { class Program { static async Task Main ( string [ ] args ) { Log . Logger = new LoggerConfiguration ( ) . MinimumLevel . Information ( ) . WriteTo . Console ( ) . CreateLogger ( ) ; Log . Logger . Information ( \"Demo starting up!\" ) ; await DemoSimpleCrawler ( ) ; await DemoSinglePageRequest ( ) ; } private static async Task DemoSimpleCrawler ( ) { var config = new CrawlConfiguration { MaxPagesToCrawl = 10 , //Only crawl 10 pages MinCrawlDelayPerDomainMilliSeconds = 3000 //Wait this many millisecs between requests } ; var crawler = new PoliteWebCrawler ( config ) ; crawler . PageCrawlCompleted += PageCrawlCompleted ; //Several events available... var crawlResult = await crawler . CrawlAsync ( new Uri ( \"http://**!!!!*!!!!YOURSITEHERE!!!!!!!!!.com\" ) ) ; } private static async Task DemoSinglePageRequest ( ) { var pageRequester = new PageRequester ( new CrawlConfiguration ( ) , new WebContentExtractor ( ) ) ; var crawledPage = await pageRequester . MakeRequestAsync ( new Uri ( \"http://goo***gle.com\" ) ) ; Log . Logger . Information ( \"{result}\" , new { url = crawledPage . Uri , status = Convert . ToInt32 ( crawledPage . HttpResponseMessage . StatusCode ) } ) ; } private static void PageCrawlCompleted ( object sender , PageCrawlCompletedArgs e ) { var httpStatus = e . CrawledPage . HttpResponseMessage . StatusCode ; var rawPageText = e . CrawledPage . Content . Text ; } } }
abot配置
abot的abot 2.Poco.CrawlConfiguration类具有大量的配置选项。您可以通过查看代码注释来查看每个配置值对爬网的影响。
var crawlConfig = new CrawlConfiguration ( ) ; crawlConfig . CrawlTimeoutSeconds = 100 ; crawlConfig . MaxConcurrentThreads = 10 ; crawlConfig . MaxPagesToCrawl = 1000 ; crawlConfig . UserAgentString = \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36\" ; crawlConfig . ConfigurationExtensions . Add ( \"SomeCustomConfigValue1\" , \"1111\" ) ; crawlConfig . ConfigurationExtensions . Add ( \"SomeCustomConfigValue2\" , \"2222\" ) ; etc .. .
abot
注册事件并创建处理方法
crawler . PageCrawlStarting += crawler_ProcessPageCrawlStarting ; crawler . PageCrawlCompleted += crawler_ProcessPageCrawlCompleted ; crawler . PageCrawlDisallowed += crawler_PageCrawlDisallowed ; crawler . PageLinksCrawlDisallowed += crawler_PageLinksCrawlDisallowed ;
void crawler_ProcessPageCrawlStarting ( object sender , PageCrawlStartingArgs e ) { PageToCrawl pageToCrawl = e . PageToCrawl ; Console . WriteLine ( $ \"About to crawl link { pageToCrawl . Uri . AbsoluteUri } which was found on page { pageToCrawl . ParentUri . AbsoluteUri } \" ) ; } void crawler_ProcessPageCrawlCompleted ( object sender , PageCrawlCompletedArgs e ) { CrawledPage crawledPage = e . CrawledPage ; if ( crawledPage . HttpRequestException != null || crawledPage . HttpResponseMessage . StatusCode != HttpStatusCode . OK ) Console . WriteLine ( $ \"Crawl of page failed { crawledPage . Uri . AbsoluteUri } \" ) ; else Console . WriteLine ( $ \"Crawl of page succeeded { crawledPage . Uri . AbsoluteUri } \" ) ; if ( string . IsNullOrEmpty ( crawledPage . Content . Text ) ) Console . WriteLine ( $ \"Page had no content { crawledPage . Uri . AbsoluteUri } \" ) ; var angleSharpHtmlDocument = crawledPage . AngleSharpHtmlDocument ; //AngleSharp parser } void crawler_PageLinksCrawlDisallowed ( object sender , PageLinksCrawlDisallowedArgs e ) { CrawledPage crawledPage = e . CrawledPage ; Console . WriteLine ( $ \"Did not crawl the links on page { crawledPage . Uri . AbsoluteUri } due to { e . DisallowedReason } \" ) ; } void crawler_PageCrawlDisallowed ( object sender , PageCrawlDisallowedArgs e ) { PageToCrawl pageToCrawl = e . PageToCrawl ; Console . WriteLine ( $ \"Did not crawl page { pageToCrawl . Uri . AbsoluteUri } due to { e . DisallowedReason } \" ) ; }
定制对象和动态爬网袋
将任意数量的自定义对象添加到动态爬网袋或页面袋中。这些对象将在crawlcontext.crawlbag对象,pagetocrawl.pagebag对象或crawledpage.pagebag对象中可用。
var crawler crawler = new PoliteWebCrawler ( ) ; crawler . CrawlBag . MyFoo1 = new Foo ( ) ; crawler . CrawlBag . MyFoo2 = new Foo ( ) ; crawler . PageCrawlStarting += crawler_ProcessPageCrawlStarting ; .. .
void crawler_ProcessPageCrawlStarting ( object sender , PageCrawlStartingArgs e ) { //Get your Foo instances from the CrawlContext object var foo1 = e . CrawlConext . CrawlBag . MyFoo1 ; var foo2 = e . CrawlConext . CrawlBag . MyFoo2 ; //Also add a dynamic value to the PageToCrawl or CrawledPage e . PageToCrawl . PageBag . Bar = new Bar ( ) ; }
消除
CancellationTokenSource cancellationTokenSource = new CancellationTokenSource ( ) ; var crawler = new PoliteWebCrawler ( ) ; var result = await crawler . CrawlAsync ( new Uri ( \"addurihere\" ) , cancellationTokenSource ) ;
定制爬网行为
abot被设计为尽可能插的。这使您可以轻松地更改其工作方式以适合您的需求。
更改abot行为的最简单方法是更改控制它们的配置值。有关可以配置abot不同方式的示例,请参见快速启动页面。
crawdecision回调/委托
有时,您不想创建一堂课,而是要通过扩展基类或直接实现界面的仪式。对于所有懒惰的开发人员, abot提供了一种速写方法,可以轻松添加您的自定义爬网决策逻辑。注意:iCrawdecisionmaker的相应方法首先调用,如果不允许”决定,则不会调用这些回调。
var crawler = new PoliteWebCrawler ( ) ; crawler . ShouldCrawlPageDecisionMaker = ( pageToCrawl , crawlContext ) => { var decision = new CrawlDecision { Allow = true } ; if ( pageToCrawl . Uri . Authority == \"google.com\" ) return new CrawlDecision { Allow = false , Reason = \"Dont want to crawl google pages\" } ; return decision ; } ; crawler . ShouldDownloadPageContentDecisionMaker = ( crawledPage , crawlContext ) => { var decision = new CrawlDecision { Allow = true } ; if ( ! crawledPage . Uri . AbsoluteUri . Contains ( \".com\" ) ) return new CrawlDecision { Allow = false , Reason = \"Only download raw page content for .com tlds\" } ; return decision ; } ; crawler . ShouldCrawlPageLinksDecisionMaker = ( crawledPage , crawlContext ) => { var decision = new CrawlDecision { Allow = true } ; if ( crawledPage . Content . Bytes . Length < 100 ) return new CrawlDecision { Allow = false , Reason = \"Just crawl links in pages that have at least 100 bytes\" } ; return decision ; } ;
自定义实现
Politewebcrawler是策划爬网的主人。它的工作是协调所有公用事业类以“爬网”一个站点。 PoliteWebCrawler通过其构造函数接受其所有依赖性的替代实现。
var crawler = new PoliteWebCrawler ( new CrawlConfiguration ( ) , new YourCrawlDecisionMaker ( ) , new YourThreadMgr ( ) , new YourScheduler ( ) , new YourPageRequester ( ) , new YourHyperLinkParser ( ) , new YourMemoryManager ( ) , new YourDomainRateLimiter , new YourRobotsDotTextFinder ( ) ) ;
通过任何实现的null将使用默认值。下面的示例将对iPagerequester和ihyperlinkparser使用您的自定义实现,但将对其他所有默认值使用。
var crawler = new PoliteWebCrawler ( null , null , null , null , new YourPageRequester ( ) , new YourHyperLinkParser ( ) , null , null , null ) ;
以下是对PoliteWebCrawler依赖于实际工作的每个接口的解释。
ICRAWLDECISIONMAKER
回调/代表快捷方式非常适合添加少量逻辑,但是如果您做的事情更为沉重,您将希望通过iCrawldecisionmaker的自定义实现。爬网称此实施方式以查看是否应爬行页面,是否应下载页面的内容以及是否应爬行页面的链接。
crawldecisionmaker.cs是abot使用的默认ICRAWLDECISIONMAKER。该类负责常见的检查,例如确保不超过配置值MaxPageStrawl。大多数用户只需要创建一个扩展crawdecision Maker的类,只需添加其自定义逻辑即可。但是,您可以完全自由创建一个实现IcrawldecisionMaker并将其传递到PoliteWebcrawlers构造函数的类。
/// <summary> /// Determines what pages should be crawled, whether the raw content should be downloaded and if the links on a page should be crawled /// </summary> public interface ICrawlDecisionMaker { /// <summary> /// Decides whether the page should be crawled /// </summary> CrawlDecision ShouldCrawlPage ( PageToCrawl pageToCrawl , CrawlContext crawlContext ) ; /// <summary> /// Decides whether the page\'s links should be crawled /// </summary> CrawlDecision ShouldCrawlPageLinks ( CrawledPage crawledPage , CrawlContext crawlContext ) ; /// <summary> /// Decides whether the page\'s content should be dowloaded /// </summary> CrawlDecision ShouldDownloadPageContent ( CrawledPage crawledPage , CrawlContext crawlContext ) ; }
ithreadmanager
ITHREADMANAGER接口处理多线程详细信息。它被搜寻器用于管理并发的HTTP请求。
TaskThreadManager.cs是abot使用的默认ITHREADMANAGER。
/// <summary> /// Handles the multithreading implementation details /// </summary> public interface IThreadManager : IDisposable { /// <summary> /// Max number of threads to use. /// </summary> int MaxThreads { get ; } /// <summary> /// Will perform the action asynchrously on a seperate thread /// </summary> /// <param name=\"action\">The action to perform</param> void DoWork ( Action action ) ; /// <summary> /// Whether there are running threads /// </summary> bool HasRunningThreads ( ) ; /// <summary> /// Abort all running threads /// </summary> void AbortAll ( ) ; }
iScheduler
Ischeduler界面涉及要管理哪些页面需要爬行的内容。爬虫提供了它找到的链接,并使页面从Ischeduler实现中爬网。编写您自己的实施的常见用例可能是在可以由分布式式套件管理的多台机器上分发爬网。
Scheduler.cs是爬网使用的默认ISCHEDULER,默认情况下使用内存集合构建,以确定哪些页面已被爬行并需要爬网。
/// <summary> /// Handles managing the priority of what pages need to be crawled /// </summary> public interface IScheduler { /// <summary> /// Count of remaining items that are currently scheduled /// </summary> int Count { get ; } /// <summary> /// Schedules the param to be crawled /// </summary> void Add ( PageToCrawl page ) ; /// <summary> /// Schedules the param to be crawled /// </summary> void Add ( IEnumerable < PageToCrawl > pages ) ; /// <summary> /// Gets the next page to crawl /// </summary> PageToCrawl GetNext ( ) ; /// <summary> /// Clear all currently scheduled pages /// </summary> void Clear ( ) ; }
ipagerequester
iPagerequester界面涉及制作原始HTTP请求。
Pagerequester.cs是爬虫使用的默认iPagerequester。
public interface IPageRequester : IDisposable { /// <summary> /// Make an http web request to the url and download its content /// </summary> Task < CrawledPage > MakeRequestAsync ( Uri uri ) ; /// <summary> /// Make an http web request to the url and download its content based on the param func decision /// </summary> Task < CrawledPage > MakeRequestAsync ( Uri uri , Func < CrawledPage , CrawlDecision > shouldDownloadContent ) ; }
ihyperlinkparser
Ihyperlinkparser界面涉及从RAW HTML中解析链接。
Anglesharphyperlinkparser.cs是爬网使用的默认ihyperlinkparser。它使用众所周知的Anglesharp进行HTML解析。 Anglesharp使用CSS样式选择器,例如jQuery,但都在C#中。
/// <summary> /// Handles parsing hyperlikns out of the raw html /// </summary> public interface IHyperLinkParser { /// <summary> /// Parses html to extract hyperlinks, converts each into an absolute url /// </summary> IEnumerable < Uri > GetLinks ( CrawledPage crawledPage ) ; }
iMemorymanager
ImeMoryManager处理内存监视。此功能仍然是实验性的,如果发现不可靠的话,可以在以后的版本中删除。
MemoryManager.cs是verler使用的默认实现。
/// <summary> /// Handles memory monitoring/usage /// </summary> public interface IMemoryManager : IMemoryMonitor , IDisposable { /// <summary> /// Whether the current process that is hosting this instance is allocated/using above the param value of memory in mb /// </summary> bool IsCurrentUsageAbove ( int sizeInMb ) ; /// <summary> /// Whether there is at least the param value of available memory in mb /// </summary> bool IsSpaceAvailable ( int sizeInMb ) ; }
IDOMAINRATELIMITER
IDOMAINRATELIMITER处理域速率限制。它将处理确定需要花费多少时间,然后才能向域提出另一个HTTP请求。
domainrateLimiter.cs是爬虫使用的默认实现。
/// <summary> /// Rate limits or throttles on a per domain basis /// </summary> public interface IDomainRateLimiter { /// <summary> /// If the domain of the param has been flagged for rate limiting, it will be rate limited according to the configured minimum crawl delay /// </summary> void RateLimit ( Uri uri ) ; /// <summary> /// Add a domain entry so that domain may be rate limited according the the param minumum crawl delay /// </summary> void AddDomain ( Uri uri , long minCrawlDelayInMillisecs ) ; }
IRObotsDottextFinder
Irobotsdottextfinder负责检索每个域的robots.txt文件(如果IsrepectrobotsdottextEnabled =“ true”)并构建robots.txt抽象,该抽象实现了IrobotsDottExtExtCtuctface。
RobotsDottExtFinder.cs是爬虫使用的默认实现。
/// <summary> /// Finds and builds the robots.txt file abstraction /// </summary> public interface IRobotsDotTextFinder { /// <summary> /// Finds the rob
