Scrapling - 左子网

安装·概述·选择方法·选择一个fetcher·从美丽的小组迁移

由于反机器保护或网站更改而处理失败的网络刮板？遇到Scrapling 。

Scrapling是一个高性能，智能的网络刮擦库，用于自动适应网站的变化，同时大大优于流行的替代方案。对于初学者和专家来说， Scrapling提供了强大的功能，同时保持简单性。

Scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
>> StealthyFetcher.auto_match = True
# Fetch websites\’ source under the radar!
>> page = StealthyFetcher.fetch(\’https://exam*ple*.c*om\’, headless=True, network_idle=True)
>> print(page.status)
200
>> products = page.css(\’.product\’, auto_save=True) # Scrape data that survives website design changes!
>> # Later, if the website structure changes, pass `auto_match=True`
>> products = page.css(\’.product\’, auto_match=True) # and Scrapling still finds them!\”>

 >> from Scrapling . fetchers import Fetcher , AsyncFetcher , StealthyFetcher , PlayWrightFetcher
>> StealthyFetcher . auto_match = True
# Fetch websites\' source under the radar!
>> page = StealthyFetcher . fetch ( \'https://*example.c**om\' , headless = True , network_idle = True )
>> print ( page . status )
200
>> products = page . css ( \'.product\' , auto_save = True )  # Scrape data that survives website design changes!
>> # Later, if the website structure changes, pass `auto_match=True`
> > products = page . css ( \'.product\' , auto_match = True )  # and Scrapling still finds them!

赞助商

_{您想在这里展示您的广告吗？单击此处，选择适合您的层！}

关键功能

随着您喜欢的异步支持，获取网站

HTTP请求：Fetcher类的快速而隐形的HTTP请求。
动态加载和自动化：通过您的真实浏览器， Scrapling的隐身模式，Playwright的Chrome浏览器或NSTBrowser的无浏览器无需浏览器，从playwrightfetcher类中获取动态网站！
防机保护措施绕过：与隐形弗格和Playwrightfetcher类轻松绕过保护。

自适应刮擦

智能元素跟踪：使用智能相似性系统和集成存储在网站更改后重新定位元素。
灵活选择：CSS选择器，XPATH选择器，基于过滤器的搜索，文本搜索，正则搜索等等。
?找到类似的元素：自动找到与您发现的元素相似的元素！
?智能内容刮擦：使用Scrapling的强大功能在没有特定选择器的情况下从多个网站中提取数据。

高性能

快速闪电：从头开始构建，要牢记性能，优于最受欢迎的Python刮擦图书馆。
?内存效率：最小内存足迹的优化数据结构。
⚡快速JSON序列化：比标准库快10倍。

开发人员友好

功能强大的导航API ：在各个方向上易于DOM遍历。
?丰富的文本处理：所有字符串都具有内置的正则延期，清洁方法等。所有元素的属性都是优化的词典，其添加的方法比标准字典消耗的内存少。
自动选择器生成：为任何元素生成健壮的短和完整的CSS/XPATH选择器。
?熟悉的API ：类似于废品/美丽的套件以及与废品中使用的相同的伪元素。
类型提示：完整的类型/DOC串覆盖范围，以实现未来的预处理和最佳的自动完成支持。

入门

Scrapling.fetchers import Fetcher

# Do HTTP GET request to a web page and create an Adaptor instance
page = Fetcher.get(\’https://quotes.**toscr*ape.com/\’, stealthy_headers=True)
# Get all text content from all HTML tags in the page except the `script` and `style` tags
page.get_all_text(ignore_tags=(\’script\’, \’style\’))

# Get all quotes elements; any of these methods will return a list of strings directly (TextHandlers)
quotes = page.css(\’.quote .text::text\’) # CSS selector
quotes = page.xpath(\’//span[@class=\”text\”]/text()\’) # XPath
quotes = page.css(\’.quote\’).css(\’.text::text\’) # Chained selectors
quotes = [element.text for element in page.css(\’.quote .text\’)] # Slower than bulk query above

# Get the first quote element
quote = page.css_first(\’.quote\’) # same as page.css(\’.quote\’).first or page.css(\’.quote\’)[0]

# Tired of selectors? Use find_all/find
# Get all \’div\’ HTML tags that one of its \’class\’ values is \’quote\’
quotes = page.find_all(\’div\’, {\’class\’: \’quote\’})
# Same as
quotes = page.find_all(\’div\’, class_=\’quote\’)
quotes = page.find_all([\’div\’], class_=\’quote\’)
quotes = page.find_all(class_=\’quote\’) # and so on…

# Working with elements
quote.html_content # Get the Inner HTML of this element
quote.prettify() # Prettified version of Inner HTML above
quote.attrib # Get that element\’s attributes
quote.path # DOM path to element (List of all ancestors from <html> tag till the element itself)\”>

 from Scrapling . fetchers import Fetcher

# Do HTTP GET request to a web page and create an Adaptor instance
page = Fetcher . get ( \'https://quotes.*tos*c*rape.com/\' , stealthy_headers = True )
# Get all text content from all HTML tags in the page except the `script` and `style` tags
page . get_all_text ( ignore_tags = ( \'script\' , \'style\' ))

# Get all quotes elements; any of these methods will return a list of strings directly (TextHandlers)
quotes = page . css ( \'.quote .text::text\' )  # CSS selector
quotes = page . xpath ( \'//span[@class=\"text\"]/text()\' )  # XPath
quotes = page . css ( \'.quote\' ). css ( \'.text::text\' )  # Chained selectors
quotes = [ element . text for element in page . css ( \'.quote .text\' )]  # Slower than bulk query above

# Get the first quote element
quote = page . css_first ( \'.quote\' )  # same as page.css(\'.quote\').first or page.css(\'.quote\')[0]

# Tired of selectors? Use find_all/find
# Get all \'div\' HTML tags that one of its \'class\' values is \'quote\'
quotes = page . find_all ( \'div\' , { \'class\' : \'quote\' })
# Same as
quotes = page . find_all ( \'div\' , class_ = \'quote\' )
quotes = page . find_all ([ \'div\' ], class_ = \'quote\' )
quotes = page . find_all ( class_ = \'quote\' )  # and so on...

# Working with elements
quote . html_content  # Get the Inner HTML of this element
quote . prettify ()  # Prettified version of Inner HTML above
quote . attrib  # Get that element\'s attributes
quote . path  # DOM path to element (List of all ancestors from <html> tag till the element itself)

为了使其简单，所有方法都可以彼此链接！

笔记

从这里查看完整的文档

解析性能

Scrapling不仅强大 – 它也很快就会燃烧。 Scrapling实现了许多最佳实践，设计模式和许多优化，以节省几秒钟的分数。所有这些都专注于解析HTML文档。这是在两个测试中进行Scrapling与流行的Python库进行比较的基准。

文本提取速度测试（5000个嵌套元素）。

该测试包括提取5000个嵌套div元素的文本内容。

＃	图书馆	时间（MS）	vs Scrapling
1	Scrapling	5.44	1.0x
2	parsel/scrapy	5.53	1.017x
3	原始LXML	6.76	1.243x
4	平柏	21.96	4.037x
5	SelectOlax	67.12	12.338x
6	BS4带LXML	1307.03	240.263X
7	机械小组	1322.64	243.132x
8	BS4与html5lib	3373.75	620.175X

如您所见， Scrapling与零食相当，并且比LXML稍快，这两个库都在其顶部建造。这些是最接近Scrapling的结果。 Pyquery也建在LXML之上，但Scrapling快四倍。

通过文本速度测试提取

Scrapling可以根据其文本内容找到元素，并找到类似于这些元素的元素。这两个功能也是唯一已知的库是自动cr。

因此，我们将其比较，以查看与加压刀相比，这两个任务在这两个任务中可以Scrapling速度。

这是结果：

图书馆	时间（MS）	vs Scrapling
Scrapling	2.51	1.0x
自动cr	11.41	4.546x

Scrapling可以找到具有更多方法的元素，并返回整个元素的适配器对象，而不仅仅是诸如AutoScraper之类的文本。因此，为了使该测试公平，两个库将提取文本元素，查找类似的元素，然后为所有元素提取文本内容。

如您所见，在同一任务中， Scrapling速度仍然快4.5倍。

如果我们仅在不停止提取每个元素的文本的情况下进行Scrapling提取元素，那么我们的速度将是这么快的速度，但是正如我所说，使其公平地比较一点。

所有基准测试结果平均为100次。有关方法论并进行比较，请参见我们的基准。

安装

Scrapling很容易开始。从0.2.9版开始，我们至少需要Python 3.9才能工作。

pip3 install Scrapling

然后运行此命令以安装使用fetcher类所需的浏览器依赖项

 Scrapling install

如果您有任何安装问题，请打开问题。

贡献

每个人都受到邀请，欢迎为Scrapling做出贡献。有很多事情要做！

在执行任何操作之前，请阅读贡献文件。

Scrapling项目的免责声明

警告

该图书馆仅用于教育和研究目的。通过使用此库，您同意遵守本地和国际数据刮擦和隐私法。作者和贡献者对滥用此软件概不负责。该图书馆不应用于侵犯他人的权利，出于不道德的目的，也不应以未经授权或非法的方式使用数据。除非您已获得网站所有者的许可或在其允许规则之内（例如robots.txt文件），否则请勿在任何网站上使用它。

执照

这项工作已根据BSD-3许可

致谢

该项目包括改编自：

PARSEL（BSD许可证） – 用于翻译子模块

谢谢和参考

Daijro在Browserforge和Camoufox上的出色工作
Vinyzu在剧作家对Botright的模拟上的工作
Brototor
假兄弟
重新布罗斯特

已知问题

在自动匹配保存过程中，选择结果中第一个元素的唯一属性是唯一被保存的元素。如果您使用的选择器在不同位置的页面上选择了不同的元素，则仅在您稍后重新安置时，自动匹配才会将第一个元素返回您。这不包括组合的CSS选择器（例如，使用逗号组合多个选择器），因为这些选择器会分开，并且每个选择器单独执行。

由Karim Shaiair设计和制作❤️。

赞助商

关键功能

随着您喜欢的异步支持，获取网站

自适应刮擦

高性能

开发人员友好

入门

解析性能

文本提取速度测试（5000个嵌套元素）。

通过文本速度测试提取

安装

贡献

Scrapling项目的免责声明

执照

致谢

谢谢和参考

已知问题

相关文章

微信

左子网

QQ交流群