Scraping Paginated APIs with Queues
Easily scalable with multiple queue workers. Jobs can be rate limited. Failures can be retried.
Inspired by Using Generators for Pagination, this is a look at how you could handle consuming paginated records from an API using Laravel’s queues.
(new GetInvoicesRequest($httpClient))->dispatch();
Why Queues?
- Easily scalable with multiple queue workers, could process a ton of pages in the just a few seconds.
- Jobs can be rate limited to avoid hitting 429 status codes.
- Job failures can be retried in the event that it doesn’t complete for some reason.
- Jobs can be batched so that current progress can be determined.
- Spread the workload out among many jobs to avoid long-running processes (if running on AWS, for example).
How?
There are two ways you could go with this. Both of them would have a job that looks something like this:
class GetInvoicesJob implements ShouldQueue
{
use Queueable;
use Batchable;
use Dispatchable;
public function __construct(
public int $page = 1
) {
}
public function handle(GetInvoicesRequest $request)
{
$response = $request->page($this->page)->send();
// do something with $response
// like update the database
}
}
The handle method could resolve the request from the service container. Then build and send the request. Then do some important thing with the response.
Batch All Jobs Up Front
If you know how many total records there are, you could create all the jobs up front.
class GetInvoicesRequest
{
public function dispatch(): Batch
{
return Bus::batch([
new GetInvoicesJob(page: 1),
new GetInvoicesJob(page: 2),
new GetInvoicesJob(page: 3),
// and so on...
])->then(function (Batch $batch) {
// completed successfully...
})->catch(function (Batch $batch, Throwable $e) {
// failure detected...
})->finally(function (Batch $batch) {
// finished executing...
})
->name(self::class)
->dispatch();
}
}
Each Job Dispatches the Next
If the total count is unknown, or if you don’t want to have the overhead of getting the count, you could make each job check if there is another page and then dispatch a job for it.
class GetInvoicesRequest
{
public function dispatch(): PendingDispatch
{
return GetInvoicesJob::dispatch(page: 1);
}
}
Then inside the job handler you’d have something like this:
class GetInvoicesJob implements ShouldQueue
{
use Queueable;
use Batchable;
use Dispatchable;
public function __construct(
public int $page = 1
) {
}
public function handle(GetInvoicesRequest $request)
{
$response = $request->page($this->page)->send();
// do something with $response
// like update the database
if ($response->hasMorePages()) {
static::dispatch(page: $this->page + 1);
}
}
}
Conclusion
I tend to like the batching method better because jobs finish faster and you can interrogate the batch to determine progress. But I’ve used both of these methods on real projects with success.
So, what do you think? Have you used queues for something like this?