mdapi is a service that provides a HTTP API to the different RPM repository metadata. To learn more about this service, its home page contains a how-to-use guide with some example.
What is the problem ?
Since mdapi runs on the Fedora infrastructure OpenShift instance some of its performances issues were highlighted. The liveliness and readiness probes used by OpenShift to monitor the service state started to time out when mdapi is under load (see the ticket). Which in turns triggered the service restart, causing quite a few requests to fail.
The packages application indexing is responsible for the heavy load. The indexing is the process that populates the full text search database of the application. This database powers the search feature of the application. The Indexing performs multiple requests to mdapi for 80 000+ packages in order to retrieve information such as the description or the upstream url of a package.
How mdapi works
mdapi is composed of 2 main components. An hourly cron job which downloads the latest RPM repository metadata databases (see an example) and a web service that provides HTTP API to these metadata.
The web service is using the aiohttp web framework in order to leverage the Python asyncio performances (you can read the original blog from pingou about mdapi). So why do we have many requests failure ?
The bottleneck
When the service was written (4 years ago), asyncio in Python was fairly new and the ecosystem was quite poor. The main bottleneck with the service is the usage of a synchronous API (using sqlalchemy) to access the sqlite databases.
This means that every requests trying to access the metadata in the database was blocking the other requests. These blocking call to sqlite are making the server deal with each request one after the other in a synchronous manner.
The solution
The solution was to replace sqlalchemy by the aiosqlite library which provides a asynchronous API to access the databases. If you are interested in the implementation details you can consult the Pull-Request.
How to test the performance
To test the performance gain, I have use the Apache ab tool which is provided by the httpd-tools package on Fedora. This tools allow to perform concurrent requests which is handy to simulate loads.
You can see below the 2 runs of the tool against the master branch (with sqlalchemy) and the feature branch (using aiosqlite). The performance improved from serving 28.5 requests per seconds to 97 requests per seconds for 100 concurrent requests.
[root@localhost /]# ab -c 100 -n 1000 http://127.0.0.1:8080/f31/pkg/guake This is ApacheBench, Version 2.3 <$Revision: 1843412 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 127.0.0.1 (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Python/3.7 Server Hostname: 127.0.0.1 Server Port: 8080 Document Path: /f31/pkg/guake Document Length: 2676 bytes Concurrency Level: 100 Time taken for tests: 35.073 seconds Complete requests: 1000 Failed requests: 0 Total transferred: 2820000 bytes HTML transferred: 2676000 bytes Requests per second: 28.51 #/sec Time per request: 3507.265 ms Time per request: 35.073 [ms] (mean, across all concurrent requests) Transfer rate: 78.52 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 1 0.8 1 4 Processing: 79 3420 324.7 3494 3605 Waiting: 75 3325 525.7 3486 3601 Total: 79 3421 324.7 3495 3605 Percentage of the requests served within a certain time (ms) 50% 3495 66% 3545 75% 3555 80% 3565 90% 3581 95% 3581 98% 3605 99% 3605 100% 3605 (longest request)
[root@localhost /]# ab -c 100 -n 1000 http://127.0.0.1:8080/f31/pkg/guake This is ApacheBench, Version 2.3 <$Revision: 1843412 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 127.0.0.1 (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Python/3.7 Server Hostname: 127.0.0.1 Server Port: 8080 Document Path: /f31/pkg/guake Document Length: 2676 bytes Concurrency Level: 100 Time taken for tests: 10.305 seconds Complete requests: 1000 Failed requests: 0 Total transferred: 2835000 bytes HTML transferred: 2676000 bytes Requests per second: 97.04 #/sec Time per request: 1030.539 ms Time per request: 10.305 [ms] (mean, across all concurrent requests) Transfer rate: 268.65 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.9 0 4 Processing: 58 989 310.0 984 1906 Waiting: 53 981 308.8 979 1905 Total: 58 990 310.1 984 1910 Percentage of the requests served within a certain time (ms) 50% 984 66% 1084 75% 1161 80% 1211 90% 1318 95% 1448 98% 1769 99% 1907 100% 1910 (longest request)
Start the discussion by commenting on the auto-created topic at discussion.fedoraproject.org