python-zyte-api¶
Python client libraries for Zyte API.
Command-line utility and asyncio-based library are provided by this package.
License is BSD 3-clause.
Installation¶
pip install zyte-api
zyte-api
requires Python 3.7+.
API key¶
Make sure you have an API key for the Zyte API service.
You can set ZYTE_API_KEY
environment
variable with the key to avoid passing it around explicitly.
Command-line interface¶
The most basic way to use the client is from a command line.
First, create a file with urls, an URL per line (e.g. urls.txt
).
Second, set ZYTE_API_KEY
env variable with your
API key (you can also pass API key as --api-key
script
argument).
Then run a script, to get the results:
zyte-api urls.txt --output res.jsonl
Note
You may use python -m zyte_api
instead of zyte-api
.
Requests to get browser HTML from those input URLs will be sent to Zyte API,
using up to 20 parallel connections, and the API responses will be stored in
the res.jsonl
JSON Lines file, 1 response per line.
The results may be stored in an order which is different from the input order.
If you need to match the output results to the input URLs, the best way is to
use the echoData
field (see below); it is passed through, and returned
as-is in the echoData
attribute. By default it will contain the input URL
the content belongs to.
If you need more flexibility, you can customize the requests by creating
a JSON Lines file with queries: a JSON object per line. You can pass any
Zyte API options there. For example, you could create the following
requests.jsonl
file:
{"url": "https://example.com", "browserHtml": true, "geolocation": "GB", "echoData": "homepage"}
{"url": "https://example.com/foo", "browserHtml": true, "javascript": false}
{"url": "https://example.com/bar", "browserHtml": true, "geolocation": "US"}
See API docs for a description of all supported parameters.
To get results for this requests.jsonl
file, run:
zyte-api requests.jsonl --output res.jsonl
Processing speed¶
Each API key has a limit on RPS. To get your URLs processed faster you can increase the number concurrent connections.
Best options depend on the RPS limit and on websites you’re extracting data from. For example, if your API key has a limit of 3RPS, and average response time you observe for your websites is 10s, then to get to these 3RPS you may set the number of concurrent connections to 30.
To set these options in the CLI, use the --n-conn
argument:
zyte-api urls.txt --n-conn 30 --output res.jsonl
If too many requests are being processed in parallel, you’ll be getting throttling errors. They are handled by CLI automatically, but they make extraction less efficient; please tune the concurrency options to not hit the throttling errors (HTTP 429) often.
You may be also limited by the website speed. The Zyte API tries not to hit any individual website too hard, but it could be better to limit this on a client side as well. If you’re extracting data from a single website, it could make sense to decrease the amount of parallel requests; it can ensure higher success ratio overall.
If you’re extracting data from multiple websites, it makes sense to spread the load across time: if you have websites A, B and C, don’t send requests in AAAABBBBCCCC order, send them in ABCABCABCABC order instead.
To do so, you can change the order of the queries in your input file.
Alternatively, you can pass --shuffle
options; it randomly shuffles
input queries before sending them to the API:
zyte-api urls.txt --shuffle --output res.jsonl
Run zyte-api --help
to get description of all supported
options.
asyncio API¶
Create an instance of the AsyncClient
to use the asyncio client API.
You can use the method request_raw
to perform individual requests:
import asyncio
from zyte_api.aio.client import AsyncClient
client = AsyncClient()
async def single_request(url):
return await client.request_raw({
'url': url,
'browserHtml': True
})
response = asyncio.run(single_request("https://books.toscrape.com"))
# Do something with the response ..
There is also request_parallel_as_completed
method, which allows
to process many URLs in parallel, using multiple connections:
import asyncio
import json
import sys
from zyte_api.aio.client import AsyncClient, create_session
from zyte_api.aio.errors import RequestError
async def extract_from(urls, n_conn):
client = AsyncClient(n_conn=n_conn)
requests = [
{"url": url, "browserHtml": True}
for url in urls
]
async with create_session(n_conn) as session:
res_iter = client.request_parallel_as_completed(requests, session=session)
for fut in res_iter:
try:
res = await fut
# do something with a result, e.g.
print(json.dumps(res))
except RequestError as e:
print(e, file=sys.stderr)
raise
urls = ["https://toscrape.com", "https://books.toscrape.com"]
asyncio.run(extract_from(urls, n_conn=15))
request_parallel_as_completed
is modelled after asyncio.as_completed
(see https://docs.python.org/3/library/asyncio-task.html#asyncio.as_completed),
and actually uses it under the hood.
request_parallel_as_completed
and request_raw
methods handle
throttling (http 429 errors) and network errors, retrying a request in
these cases.
CLI interface implementation (zyte_api/__main__.py
) can serve
as an usage example.
API Reference¶
Python client libraries and command line utilities for Zyte API |
Contributing¶
python-zyte-api is an open-source project. Your contribution is very welcome!
Issue Tracker¶
If you have a bug report, a new feature proposal or simply would like to make a question, please check our issue tracker on Github: https://github.com/zytedata/python-zyte-api/issues
Source code¶
Our source code is hosted on Github: https://github.com/zytedata/python-zyte-api
Before opening a pull request, it might be worth checking current and previous issues. Some code changes might also require some discussion before being accepted so it might be worth opening a new issue before implementing huge or breaking changes.
Testing¶
We use tox to run tests with different Python versions:
tox
The command above also runs type checks; we use mypy.
Changes¶
0.4.5 (2023-01-03)¶
w3lib >= 2.1.1 is required in install_requires, to ensure that URLs are escaped properly.
unnecessary
requests
library is removed from install_requiresfixed tox 4 support
0.4.4 (2022-12-01)¶
Fixed an issue with submitting URLs which contain unescaped symbols
New “retrying” argument for AsyncClient.__init__, which allows to set custom retrying policy for the client
--dont-retry-errors
argument in the CLI tool
0.4.3 (2022-11-10)¶
Connections are no longer reused between requests. This reduces the amount of
ServerDisconnectedError
exceptions.
0.4.2 (2022-10-28)¶
Bump minimum
aiohttp
version to 3.8.0, as earlier versions don’t support brotli decompression of responsesDeclared Python 3.11 support
0.4.1 (2022-10-16)¶
Network errors, like server timeouts or disconnections, are now retried for up to 15 minutes, instead of 5 minutes.
0.4.0 (2022-09-20)¶
Require to install
Brotli
as a dependency. This changes the requests to haveAccept-Encoding: br
and automatically decompress brotli responses.
0.3.0 (2022-07-29)¶
Internal AggStats class is cleaned up:
AggStats.n_extracted_queries
attribute is removed, as it was a duplicate ofAggStats.n_results
AggStats.n_results
is renamed toAggStats.n_success
AggStats.n_input_queries
is removed as redundant and misleading; AggStats got a newAggStats.n_processed
property instead.
This change is backwards incompatible if you used stats directly.
0.2.1 (2022-07-29)¶
aiohttp.client_exceptions.ClientConnectorError
is now treated as a network error and retried accordingly.Removed the unused
zyte_api.sync
module.
0.2.0 (2022-07-14)¶
Temporary download errors are now retried 3 times by default. They were not retried in previous releases.
0.1.4 (2022-05-21)¶
This release contains usability improvements to the command-line script:
Instead of
python -m zyte_api
you can now run it aszyte-api
;the type of the input file (
--intype
argument) is guessed now, based on file extension and content; .jl, .jsonl and .txt files are supported.
0.1.3 (2022-02-03)¶
Minor documenation fix
Remove support for Python 3.6
Added support for Python 3.10
0.1.2 (2021-11-10)¶
Default timeouts changed
0.1.1 (2021-11-01)¶
CHANGES.rst updated properly
0.1.0 (2021-11-01)¶
Initial release.
License¶
Copyright (c) Zyte Group Ltd All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Zyte nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.