abot

2025-12-07 0 794

abot

请播放这个项目!

C#Web爬行者为速度和灵活性而构建。

abot是为速度和灵活性而建立的开源C#Web爬网框架。它需要负责低级别的管道(多线程,HTTP请求,调度,链接解析等)。您只需注册即可处理页面数据的事件。您还可以插入自己的核心接口实现,以完全控制爬网过程。 abot nuget软件包版本> = 2.0目标dotnet标准2.0和abot nuget软件包版本<2.0 Targets .NET版本4.0,这使其与许多.NET框架/核心实现都高度兼容。

有什么好处?
  • 开源(用于商业和个人使用免费)
  • 快速,真的很快!!
  • 易于自定义(可插入式体系结构使您可以确定什么被爬行以及如何)
  • 经过大量测试(高码覆盖范围)
  • 非常轻巧(不超过工程)
  • 没有流程依赖性(没有数据库,没有安装服务等…)
感兴趣的链接
  • 问一个问题,请先搜索类似的问题!!!
  • 报告一个错误
  • 了解如何贡献
  • 需要专家abot定制吗?
  • 进行用法调查以帮助优先级功能/改进
  • 考虑捐款
使用abot X进行更强大的扩展/包装器
  • 同时爬行多个站点
  • 执行/渲染JavaScript
  • 避免被网站阻塞
  • 自动调整
  • 自动节流
  • 暂停/简历实时爬网
  • 简化的可实力/可扩展性


快速开始

安装abot
  • 使用Nuget安装abot
PM > Install-Package abot 
使用abot

abot2.Core;
using abot 2.Crawler;
using abot 2.Poco;
using Serilog;

namespace Test abot Use
{
class Program
{
static async Task Main(string[] args)
{
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.WriteTo.Console()
.CreateLogger();

Log.Logger.Information(\”Demo starting up!\”);

await DemoSimpleCrawler();
await DemoSinglePageRequest();
}

private static async Task DemoSimpleCrawler()
{
var config = new CrawlConfiguration
{
MaxPagesToCrawl = 10, //Only crawl 10 pages
MinCrawlDelayPerDomainMilliSeconds = 3000 //Wait this many millisecs between requests
};
var crawler = new PoliteWebCrawler(config);

crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available…

var crawlResult = await crawler.CrawlAsync(new Uri(\”http://**!!!!*!!!!YOURSITEHERE!!!!!!!!!.com\”));
}

private static async Task DemoSinglePageRequest()
{
var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());

var crawledPage = await pageRequester.MakeRequestAsync(new Uri(\”http://goo***gle.com\”));
Log.Logger.Information(\”{result}\”, new
{
url = crawledPage.Uri,
status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
});
}

private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
var httpStatus = e.CrawledPage.HttpResponseMessage.StatusCode;
var rawPageText = e.CrawledPage.Content.Text;
}
}
}
\”>

 using System ;
using System . Threading . Tasks ;
using abot 2 . Core ;
using abot 2 . Crawler ;
using abot 2 . Poco ;
using Serilog ;

namespace Test abot Use
{
    class Program
    {
        static async Task Main ( string [ ] args )
        {
            Log . Logger = new LoggerConfiguration ( )
                . MinimumLevel . Information ( )
                . WriteTo . Console ( )
                . CreateLogger ( ) ;

            Log . Logger . Information ( \"Demo starting up!\" ) ;

            await DemoSimpleCrawler ( ) ;
            await DemoSinglePageRequest ( ) ;
        }

        private static async Task DemoSimpleCrawler ( )
        {
            var config = new CrawlConfiguration
            {
                MaxPagesToCrawl = 10 , //Only crawl 10 pages
                MinCrawlDelayPerDomainMilliSeconds = 3000 //Wait this many millisecs between requests
            } ;
            var crawler = new PoliteWebCrawler ( config ) ;

            crawler . PageCrawlCompleted += PageCrawlCompleted ; //Several events available...

            var crawlResult = await crawler . CrawlAsync ( new Uri ( \"http://**!!!!*!!!!YOURSITEHERE!!!!!!!!!.com\" ) ) ;
        }

        private static async Task DemoSinglePageRequest ( )
        {
            var pageRequester = new PageRequester ( new CrawlConfiguration ( ) , new WebContentExtractor ( ) ) ;

            var crawledPage = await pageRequester . MakeRequestAsync ( new Uri ( \"http://goo***gle.com\" ) ) ;
            Log . Logger . Information ( \"{result}\" , new
            {
                url = crawledPage . Uri ,
                status = Convert . ToInt32 ( crawledPage . HttpResponseMessage . StatusCode )
            } ) ;
        }

        private static void PageCrawlCompleted ( object sender , PageCrawlCompletedArgs e )
        {
            var httpStatus = e . CrawledPage . HttpResponseMessage . StatusCode ;
            var rawPageText = e . CrawledPage . Content . Text ;
        }
    }
} 

abot配置

abot的abot 2.Poco.CrawlConfiguration类具有大量的配置选项。您可以通过查看代码注释来查看每个配置值对爬网的影响。

 var crawlConfig = new CrawlConfiguration ( ) ;
crawlConfig . CrawlTimeoutSeconds = 100 ;
crawlConfig . MaxConcurrentThreads = 10 ;
crawlConfig . MaxPagesToCrawl = 1000 ;
crawlConfig . UserAgentString = \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36\" ;
crawlConfig . ConfigurationExtensions . Add ( \"SomeCustomConfigValue1\" , \"1111\" ) ;
crawlConfig . ConfigurationExtensions . Add ( \"SomeCustomConfigValue2\" , \"2222\" ) ;
etc .. . 

abot

注册事件并创建处理方法

 crawler . PageCrawlStarting += crawler_ProcessPageCrawlStarting ;
crawler . PageCrawlCompleted += crawler_ProcessPageCrawlCompleted ;
crawler . PageCrawlDisallowed += crawler_PageCrawlDisallowed ;
crawler . PageLinksCrawlDisallowed += crawler_PageLinksCrawlDisallowed ; 
 void crawler_ProcessPageCrawlStarting ( object sender , PageCrawlStartingArgs e )
{
	PageToCrawl pageToCrawl = e . PageToCrawl ;
	Console . WriteLine ( $ \"About to crawl link { pageToCrawl . Uri . AbsoluteUri } which was found on page { pageToCrawl . ParentUri . AbsoluteUri } \" ) ;
}

void crawler_ProcessPageCrawlCompleted ( object sender , PageCrawlCompletedArgs e )
{
	CrawledPage crawledPage = e . CrawledPage ;	
	if ( crawledPage . HttpRequestException != null || crawledPage . HttpResponseMessage . StatusCode != HttpStatusCode . OK )
		Console . WriteLine ( $ \"Crawl of page failed { crawledPage . Uri . AbsoluteUri } \" ) ;
	else
		Console . WriteLine ( $ \"Crawl of page succeeded { crawledPage . Uri . AbsoluteUri } \" ) ;

	if ( string . IsNullOrEmpty ( crawledPage . Content . Text ) )
		Console . WriteLine ( $ \"Page had no content { crawledPage . Uri . AbsoluteUri } \" ) ;

	var angleSharpHtmlDocument = crawledPage . AngleSharpHtmlDocument ; //AngleSharp parser
}

void crawler_PageLinksCrawlDisallowed ( object sender , PageLinksCrawlDisallowedArgs e )
{
	CrawledPage crawledPage = e . CrawledPage ;
	Console . WriteLine ( $ \"Did not crawl the links on page { crawledPage . Uri . AbsoluteUri } due to { e . DisallowedReason } \" ) ;
}

void crawler_PageCrawlDisallowed ( object sender , PageCrawlDisallowedArgs e )
{
	PageToCrawl pageToCrawl = e . PageToCrawl ;
	Console . WriteLine ( $ \"Did not crawl page { pageToCrawl . Uri . AbsoluteUri } due to { e . DisallowedReason } \" ) ;
} 

定制对象和动态爬网袋

将任意数量的自定义对象添加到动态爬网袋或页面袋中。这些对象将在crawlcontext.crawlbag对象,pagetocrawl.pagebag对象或crawledpage.pagebag对象中可用。

 var crawler crawler = new PoliteWebCrawler ( ) ;
crawler . CrawlBag . MyFoo1 = new Foo ( ) ;
crawler . CrawlBag . MyFoo2 = new Foo ( ) ;
crawler . PageCrawlStarting += crawler_ProcessPageCrawlStarting ;
.. . 
 void crawler_ProcessPageCrawlStarting ( object sender , PageCrawlStartingArgs e )
{
    //Get your Foo instances from the CrawlContext object
    var foo1 = e . CrawlConext . CrawlBag . MyFoo1 ;
    var foo2 = e . CrawlConext . CrawlBag . MyFoo2 ;

    //Also add a dynamic value to the PageToCrawl or CrawledPage
    e . PageToCrawl . PageBag . Bar = new Bar ( ) ;
} 

消除

 CancellationTokenSource cancellationTokenSource = new CancellationTokenSource ( ) ;

var crawler = new PoliteWebCrawler ( ) ;
var result = await crawler . CrawlAsync ( new Uri ( \"addurihere\" ) , cancellationTokenSource ) ; 


定制爬网行为

abot被设计为尽可能插的。这使您可以轻松地更改其工作方式以适合您的需求。

更改abot行为的最简单方法是更改控制它们的配置值。有关可以配置abot不同方式的示例,请参见快速启动页面。

crawdecision回调/委托

有时,您不想创建一堂课,而是要通过扩展基类或直接实现界面的仪式。对于所有懒惰的开发人员, abot提供了一种速写方法,可以轻松添加您的自定义爬网决策逻辑。注意:iCrawdecisionmaker的相应方法首先调用,如果不允许”决定,则不会调用这些回调。

 var crawler = new PoliteWebCrawler ( ) ;

crawler . ShouldCrawlPageDecisionMaker = ( pageToCrawl , crawlContext ) => 
{
	var decision = new CrawlDecision { Allow = true } ;
	if ( pageToCrawl . Uri . Authority == \"google.com\" )
		return new CrawlDecision { Allow = false , Reason = \"Dont want to crawl google pages\" } ;
	
	return decision ;
} ;

crawler . ShouldDownloadPageContentDecisionMaker = ( crawledPage , crawlContext ) =>
{
	var decision = new CrawlDecision { Allow = true } ;
	if ( ! crawledPage . Uri . AbsoluteUri . Contains ( \".com\" ) )
		return new CrawlDecision { Allow = false , Reason = \"Only download raw page content for .com tlds\" } ;

	return decision ;
} ;

crawler . ShouldCrawlPageLinksDecisionMaker = ( crawledPage , crawlContext ) =>
{
	var decision = new CrawlDecision { Allow = true } ;
	if ( crawledPage . Content . Bytes . Length < 100 )
		return new CrawlDecision { Allow = false , Reason = \"Just crawl links in pages that have at least 100 bytes\" } ;

	return decision ;
} ; 

自定义实现

Politewebcrawler是策划爬网的主人。它的工作是协调所有公用事业类以“爬网”一个站点。 PoliteWebCrawler通过其构造函数接受其所有依赖性的替代实现。

 var crawler = new PoliteWebCrawler (
    	new CrawlConfiguration ( ) ,
	new YourCrawlDecisionMaker ( ) ,
	new YourThreadMgr ( ) , 
	new YourScheduler ( ) , 
	new YourPageRequester ( ) , 
	new YourHyperLinkParser ( ) , 
	new YourMemoryManager ( ) , 
    	new YourDomainRateLimiter ,
	new YourRobotsDotTextFinder ( ) ) ;

通过任何实现的null将使用默认值。下面的示例将对iPagerequester和ihyperlinkparser使用您的自定义实现,但将对其他所有默认值使用。

 var crawler = new PoliteWebCrawler (
	null , 
	null , 
    	null ,
    	null ,
	new YourPageRequester ( ) , 
	new YourHyperLinkParser ( ) , 
	null ,
    	null , 
	null ) ;

以下是对PoliteWebCrawler依赖于实际工作的每个接口的解释。

ICRAWLDECISIONMAKER

回调/代表快捷方式非常适合添加少量逻辑,但是如果您做的事情更为沉重,您将希望通过iCrawldecisionmaker的自定义实现。爬网称此实施方式以查看是否应爬行页面,是否应下载页面的内容以及是否应爬行页面的链接。

crawldecisionmaker.cs是abot使用的默认ICRAWLDECISIONMAKER。该类负责常见的检查,例如确保不超过配置值MaxPageStrawl。大多数用户只需要创建一个扩展crawdecision Maker的类,只需添加其自定义逻辑即可。但是,您可以完全自由创建一个实现IcrawldecisionMaker并将其传递到PoliteWebcrawlers构造函数的类。

 /// <summary>
/// Determines what pages should be crawled, whether the raw content should be downloaded and if the links on a page should be crawled
/// </summary>
public interface ICrawlDecisionMaker
{
	/// <summary>
	/// Decides whether the page should be crawled
	/// </summary>
	CrawlDecision ShouldCrawlPage ( PageToCrawl pageToCrawl , CrawlContext crawlContext ) ;

	/// <summary>
	/// Decides whether the page\'s links should be crawled
	/// </summary>
	CrawlDecision ShouldCrawlPageLinks ( CrawledPage crawledPage , CrawlContext crawlContext ) ;

	/// <summary>
	/// Decides whether the page\'s content should be dowloaded
	/// </summary>
	CrawlDecision ShouldDownloadPageContent ( CrawledPage crawledPage , CrawlContext crawlContext ) ;
} 
ithreadmanager

ITHREADMANAGER接口处理多线程详细信息。它被搜寻器用于管理并发的HTTP请求。

TaskThreadManager.cs是abot使用的默认ITHREADMANAGER。

 /// <summary>
/// Handles the multithreading implementation details
/// </summary>
public interface IThreadManager : IDisposable
{
	/// <summary>
	/// Max number of threads to use.
	/// </summary>
	int MaxThreads { get ; }

	/// <summary>
	/// Will perform the action asynchrously on a seperate thread
	/// </summary>
	/// <param name=\"action\">The action to perform</param>
	void DoWork ( Action action ) ;

	/// <summary>
	/// Whether there are running threads
	/// </summary>
	bool HasRunningThreads ( ) ;

	/// <summary>
	/// Abort all running threads
	/// </summary>
	void AbortAll ( ) ;
} 
iScheduler

Ischeduler界面涉及要管理哪些页面需要爬行的内容。爬虫提供了它找到的链接,并使页面从Ischeduler实现中爬网。编写您自己的实施的常见用例可能是在可以由分布式式套件管理的多台机器上分发爬网。

Scheduler.cs是爬网使用的默认ISCHEDULER,默认情况下使用内存集合构建,以确定哪些页面已被爬行并需要爬网。

 /// <summary>
/// Handles managing the priority of what pages need to be crawled
/// </summary>
public interface IScheduler
{
	/// <summary>
	/// Count of remaining items that are currently scheduled
	/// </summary>
	int Count { get ; }

	/// <summary>
	/// Schedules the param to be crawled
	/// </summary>
	void Add ( PageToCrawl page ) ;

	/// <summary>
	/// Schedules the param to be crawled
	/// </summary>
	void Add ( IEnumerable < PageToCrawl > pages ) ;

	/// <summary>
	/// Gets the next page to crawl
	/// </summary>
	PageToCrawl GetNext ( ) ;

	/// <summary>
	/// Clear all currently scheduled pages
	/// </summary>
	void Clear ( ) ;
} 
ipagerequester

iPagerequester界面涉及制作原始HTTP请求。

Pagerequester.cs是爬虫使用的默认iPagerequester。

 public interface IPageRequester : IDisposable
{
	/// <summary>
	/// Make an http web request to the url and download its content
	/// </summary>
	Task < CrawledPage > MakeRequestAsync ( Uri uri ) ;

	/// <summary>
	/// Make an http web request to the url and download its content based on the param func decision
	/// </summary>
	Task < CrawledPage > MakeRequestAsync ( Uri uri , Func < CrawledPage , CrawlDecision > shouldDownloadContent ) ;
} 
ihyperlinkparser

Ihyperlinkparser界面涉及从RAW HTML中解析链接。

Anglesharphyperlinkparser.cs是爬网使用的默认ihyperlinkparser。它使用众所周知的Anglesharp进行HTML解析。 Anglesharp使用CSS样式选择器,例如jQuery,但都在C#中。

 /// <summary>
/// Handles parsing hyperlikns out of the raw html
/// </summary>
public interface IHyperLinkParser
{
	/// <summary>
	/// Parses html to extract hyperlinks, converts each into an absolute url
	/// </summary>
	IEnumerable < Uri > GetLinks ( CrawledPage crawledPage ) ;
} 
iMemorymanager

ImeMoryManager处理内存监视。此功能仍然是实验性的,如果发现不可靠的话,可以在以后的版本中删除。

MemoryManager.cs是verler使用的默认实现。

 /// <summary>
/// Handles memory monitoring/usage
/// </summary>
public interface IMemoryManager : IMemoryMonitor , IDisposable
{
	/// <summary>
	/// Whether the current process that is hosting this instance is allocated/using above the param value of memory in mb
	/// </summary>
	bool IsCurrentUsageAbove ( int sizeInMb ) ;

	/// <summary>
	/// Whether there is at least the param value of available memory in mb
	/// </summary>
	bool IsSpaceAvailable ( int sizeInMb ) ;
} 
IDOMAINRATELIMITER

IDOMAINRATELIMITER处理域速率限制。它将处理确定需要花费多少时间,然后才能向域提出另一个HTTP请求。

domainrateLimiter.cs是爬虫使用的默认实现。

 /// <summary>
/// Rate limits or throttles on a per domain basis
/// </summary>
public interface IDomainRateLimiter
{
	/// <summary>
	/// If the domain of the param has been flagged for rate limiting, it will be rate limited according to the configured minimum crawl delay
	/// </summary>
	void RateLimit ( Uri uri ) ;

	/// <summary>
	/// Add a domain entry so that domain may be rate limited according the the param minumum crawl delay
	/// </summary>
	void AddDomain ( Uri uri , long minCrawlDelayInMillisecs ) ;
} 
IRObotsDottextFinder

Irobotsdottextfinder负责检索每个域的robots.txt文件(如果IsrepectrobotsdottextEnabled =“ true”)并构建robots.txt抽象,该抽象实现了IrobotsDottExtExtCtuctface。

RobotsDottExtFinder.cs是爬虫使用的默认实现。

 /// <summary>
/// Finds and builds the robots.txt file abstraction
/// </summary>
public interface IRobotsDotTextFinder
{
	/// <summary>
	/// Finds the rob

下载源码

通过命令行克隆项目:

git clone https://github.com/sjdirect/abot.git

收藏 (0) 打赏

感谢您的支持,我会继续努力的!

打开微信/支付宝扫一扫,即可进行扫码打赏哦,分享从这里开始,精彩与您同在
点赞 (0)

申明:本文由第三方发布,内容仅代表作者观点,与本网站无关。对本文以及其中全部或者部分内容的真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。本网发布或转载文章出于传递更多信息之目的,并不意味着赞同其观点或证实其描述,也不代表本网对其真实性负责。

左子网 开发教程 abot https://www.zuozi.net/31726.html

EventFlow
上一篇: EventFlow
NLua
下一篇: NLua
常见问题
  • 1、自动:拍下后,点击(下载)链接即可下载;2、手动:拍下后,联系卖家发放即可或者联系官方找开发者发货。
查看详情
  • 1、源码默认交易周期:手动发货商品为1-3天,并且用户付款金额将会进入平台担保直到交易完成或者3-7天即可发放,如遇纠纷无限期延长收款金额直至纠纷解决或者退款!;
查看详情
  • 1、描述:源码描述(含标题)与实际源码不一致的(例:货不对板); 2、演示:有演示站时,与实际源码小于95%一致的(但描述中有”不保证完全一样、有变化的可能性”类似显著声明的除外); 3、发货:不发货可无理由退款; 4、安装:免费提供安装服务的源码但卖家不履行的; 5、收费:价格虚标,额外收取其他费用的(但描述中有显著声明或双方交易前有商定的除外); 6、其他:如质量方面的硬性常规问题BUG等。 注:经核实符合上述任一,均支持退款,但卖家予以积极解决问题则除外。
查看详情
  • 1、左子会对双方交易的过程及交易商品的快照进行永久存档,以确保交易的真实、有效、安全! 2、左子无法对如“永久包更新”、“永久技术支持”等类似交易之后的商家承诺做担保,请买家自行鉴别; 3、在源码同时有网站演示与图片演示,且站演与图演不一致时,默认按图演作为纠纷评判依据(特别声明或有商定除外); 4、在没有”无任何正当退款依据”的前提下,商品写有”一旦售出,概不支持退款”等类似的声明,视为无效声明; 5、在未拍下前,双方在QQ上所商定的交易内容,亦可成为纠纷评判依据(商定与描述冲突时,商定为准); 6、因聊天记录可作为纠纷评判依据,故双方联系时,只与对方在左子上所留的QQ、手机号沟通,以防对方不承认自我承诺。 7、虽然交易产生纠纷的几率很小,但一定要保留如聊天记录、手机短信等这样的重要信息,以防产生纠纷时便于左子介入快速处理。
查看详情

相关文章

猜你喜欢
发表评论
暂无评论
官方客服团队

为您解决烦忧 - 24小时在线 专业服务