练习url:
一 获取文本值
xpath
In [18]: response.selector.xpath('//title/text()').extract_first(default='')Out[18]: 'Example website'
css
In [19]: response.selector.css('title::text').extract_first(default='')Out[19]: 'Example website'
注:可以省略写成:response.xpath()
二 获取属性值
xpath
In [23]: response.selector.xpath('//base/@href').extract_first()Out[23]: 'http://example.com/'
css
In [24]: response.selector.css('base::attr(href)').extract_first()Out[24]: 'http://example.com/'
注: 可以省略写成:response.css
三 xpath,css嵌套使用
因为css,xpath返回的是 SelectorList 实例,所有可以嵌套便捷的使用。
ps:获取属性,xpath,@已经实现, 并不需要 /text()
In [21]: response.selector.css('img').xpath('@src').extract()Out[21]:['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
四 .re()
.re()
.re_first()
ps :返回的是unicode构成的列表,所以,不能嵌套使用 .re()
In [1]: response.selector.css('div > p:nth-of-type(2)::text').extract()Out[1]: ['333xxx']In [2]: response.selector.css('div > p:nth-of-type(2)::text').extract_first()Out[2]: '333xxx'In [3]: response.selector.css('div > p:nth-of-type(2)::text').re_first('\w+')Out[3]: '333xxx'In [4]: response.selector.css('div > p:nth-of-type(2)::text').re_first('[A-Za-z]+')Out[4]: 'xxx'In [5]: response.selector.css('div > p:nth-of-type(2)::text').re('[A-Za-z]+')Out[5]: ['xxx']
五 关于Xpath的相对路径查找的注意
查找div标签下p标签
11
222
333
错误做法:
In [4]: divs = response.selector.xpath('//div')In [5]: for p in divs.xpath('//p'): ...: print(p.extract()) ...:11
222
333
正确做法 1:
In [6]: divs = response.selector.css('div')In [7]: for p in divs.xpath('.//p'): ...: print(p.extract()) ...: ...:222
333
正确做法 2:
In [8]: divs = response.selector.css('div')In [9]: for p in divs.xpath('p'): ...: print(p.extract()) ...: ...: ...:222
333