python-zyte-api

Python client libraries for Zyte API.

Command-line utility and asyncio-based library are provided by this package.

License is BSD 3-clause.

Installation

pip install zyte-api

zyte-api requires Python 3.7+.

API key

Make sure you have an API key for the Zyte API service. You can set ZYTE_API_KEY environment variable with the key to avoid passing it around explicitly.

Command-line interface

The most basic way to use the client is from a command line.

First, create a file with urls, an URL per line (e.g. urls.txt).

Second, set ZYTE_API_KEY env variable with your API key (you can also pass API key as --api-key script argument).

Then run a script, to get the results:

zyte-api urls.txt --output res.jsonl

Note

You may use python -m zyte_api instead of zyte-api.

Requests to get browser HTML from those input URLs will be sent to Zyte API, using up to 20 parallel connections, and the API responses will be stored in the res.jsonl JSON Lines file, 1 response per line.

The results may be stored in an order which is different from the input order. If you need to match the output results to the input URLs, the best way is to use the echoData field (see below); it is passed through, and returned as-is in the echoData attribute. By default it will contain the input URL the content belongs to.

If you need more flexibility, you can customize the requests by creating a JSON Lines file with queries: a JSON object per line. You can pass any Zyte API options there. For example, you could create the following requests.jsonl file:

{"url": "https://example.com", "browserHtml": true, "geolocation": "GB", "echoData": "homepage"}
{"url": "https://example.com/foo", "browserHtml": true, "javascript": false}
{"url": "https://example.com/bar", "browserHtml": true, "geolocation": "US"}

See API docs for a description of all supported parameters.

To get results for this requests.jsonl file, run:

zyte-api requests.jsonl --output res.jsonl

Processing speed

Each API key has a limit on RPS. To get your URLs processed faster you can increase the number concurrent connections.

Best options depend on the RPS limit and on websites you’re extracting data from. For example, if your API key has a limit of 3RPS, and average response time you observe for your websites is 10s, then to get to these 3RPS you may set the number of concurrent connections to 30.

To set these options in the CLI, use the --n-conn argument:

zyte-api urls.txt --n-conn 30 --output res.jsonl

If too many requests are being processed in parallel, you’ll be getting throttling errors. They are handled by CLI automatically, but they make extraction less efficient; please tune the concurrency options to not hit the throttling errors (HTTP 429) often.

You may be also limited by the website speed. The Zyte API tries not to hit any individual website too hard, but it could be better to limit this on a client side as well. If you’re extracting data from a single website, it could make sense to decrease the amount of parallel requests; it can ensure higher success ratio overall.

If you’re extracting data from multiple websites, it makes sense to spread the load across time: if you have websites A, B and C, don’t send requests in AAAABBBBCCCC order, send them in ABCABCABCABC order instead.

To do so, you can change the order of the queries in your input file. Alternatively, you can pass --shuffle options; it randomly shuffles input queries before sending them to the API:

zyte-api urls.txt --shuffle --output res.jsonl

Run zyte-api --help to get description of all supported options.

asyncio API

Create an instance of the AsyncClient to use the asyncio client API. You can use the method request_raw to perform individual requests:

import asyncio
from zyte_api.aio.client import AsyncClient

client = AsyncClient()

async def single_request(url):
    return await client.request_raw({
        'url': url,
        'browserHtml': True
    })

response = asyncio.run(single_request("https://books.toscrape.com"))
# Do something with the response ..

There is also request_parallel_as_completed method, which allows to process many URLs in parallel, using multiple connections:

import asyncio
import json
import sys

from zyte_api.aio.client import AsyncClient, create_session
from zyte_api.aio.errors import RequestError

async def extract_from(urls, n_conn):
    client = AsyncClient(n_conn=n_conn)
    requests = [
        {"url": url, "browserHtml": True}
        for url in urls
    ]
    async with create_session(n_conn) as session:
        res_iter = client.request_parallel_as_completed(requests, session=session)
        for fut in res_iter:
            try:
                res = await fut
                # do something with a result, e.g.
                print(json.dumps(res))
            except RequestError as e:
                print(e, file=sys.stderr)
                raise

urls = ["https://toscrape.com", "https://books.toscrape.com"]
asyncio.run(extract_from(urls, n_conn=15))

request_parallel_as_completed is modelled after asyncio.as_completed (see https://docs.python.org/3/library/asyncio-task.html#asyncio.as_completed), and actually uses it under the hood.

request_parallel_as_completed and request_raw methods handle throttling (http 429 errors) and network errors, retrying a request in these cases.

CLI interface implementation (zyte_api/__main__.py) can serve as an usage example.

API Reference

zyte_api

Python client libraries and command line utilities for Zyte API

Contributing

python-zyte-api is an open-source project. Your contribution is very welcome!

Issue Tracker

If you have a bug report, a new feature proposal or simply would like to make a question, please check our issue tracker on Github: https://github.com/zytedata/python-zyte-api/issues

Source code

Our source code is hosted on Github: https://github.com/zytedata/python-zyte-api

Before opening a pull request, it might be worth checking current and previous issues. Some code changes might also require some discussion before being accepted so it might be worth opening a new issue before implementing huge or breaking changes.

Testing

We use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.

Changes

0.4.5 (2023-01-03)

  • w3lib >= 2.1.1 is required in install_requires, to ensure that URLs are escaped properly.

  • unnecessary requests library is removed from install_requires

  • fixed tox 4 support

0.4.4 (2022-12-01)

  • Fixed an issue with submitting URLs which contain unescaped symbols

  • New “retrying” argument for AsyncClient.__init__, which allows to set custom retrying policy for the client

  • --dont-retry-errors argument in the CLI tool

0.4.3 (2022-11-10)

  • Connections are no longer reused between requests. This reduces the amount of ServerDisconnectedError exceptions.

0.4.2 (2022-10-28)

  • Bump minimum aiohttp version to 3.8.0, as earlier versions don’t support brotli decompression of responses

  • Declared Python 3.11 support

0.4.1 (2022-10-16)

  • Network errors, like server timeouts or disconnections, are now retried for up to 15 minutes, instead of 5 minutes.

0.4.0 (2022-09-20)

  • Require to install Brotli as a dependency. This changes the requests to have Accept-Encoding: br and automatically decompress brotli responses.

0.3.0 (2022-07-29)

Internal AggStats class is cleaned up:

  • AggStats.n_extracted_queries attribute is removed, as it was a duplicate of AggStats.n_results

  • AggStats.n_results is renamed to AggStats.n_success

  • AggStats.n_input_queries is removed as redundant and misleading; AggStats got a new AggStats.n_processed property instead.

This change is backwards incompatible if you used stats directly.

0.2.1 (2022-07-29)

  • aiohttp.client_exceptions.ClientConnectorError is now treated as a network error and retried accordingly.

  • Removed the unused zyte_api.sync module.

0.2.0 (2022-07-14)

  • Temporary download errors are now retried 3 times by default. They were not retried in previous releases.

0.1.4 (2022-05-21)

This release contains usability improvements to the command-line script:

  • Instead of python -m zyte_api you can now run it as zyte-api;

  • the type of the input file (--intype argument) is guessed now, based on file extension and content; .jl, .jsonl and .txt files are supported.

0.1.3 (2022-02-03)

  • Minor documenation fix

  • Remove support for Python 3.6

  • Added support for Python 3.10

0.1.2 (2021-11-10)

  • Default timeouts changed

0.1.1 (2021-11-01)

  • CHANGES.rst updated properly

0.1.0 (2021-11-01)

  • Initial release.

License

Copyright (c) Zyte Group Ltd All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of Zyte nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.