Thanks for working on this and sharing your plan.
Some imagination may be unavoidable to find the cause of bug from log but
getting useful information from log is essential for us to debug.
It may be helpful if CI enables always SPDK trace and makes accessible trace outputs when
Trace function is very useful but may not be integrated with CI yet.
This is just an idea and I may start to investigate but others will be right person.
差出人: SPDK <spdk-bounces(a)lists.01.org> が Stojaczyk, Dariusz
送信日時: 2019年6月17日 16:11
宛先: Storage Performance Development Kit
件名: [!]Re: [SPDK] Making test output readable - Call for action
From: SPDK [mailto:firstname.lastname@example.org] On Behalf Of Stojaczyk,
Sent: Monday, June 17, 2019 5:30 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: [SPDK] Making test output readable - Call for action
The log parser I published not long ago allows us to comfortably look
through the SPDK test output and see which tests were run or which
ones were not. It's a nice tool already, but it's designated purpose
is to simplify figuring out what actually went wrong in a test run
that reported a failure. In order to achieve that goal, I'm pretty
sure we'll need some critical changes to our test scripts. That's
because a lot of the tests were never designed to be easy to debug. I
do have a few suggestions I'll be trying to push on, but I encourage
everyone to contribute as well. If you wanted to see some improvements
in SPDK autotest, now is the time to shine.
Here's a few changes I plan to introduce:
1) Send each RPC command in a separate bash command. Running rpc.py
and sending an RPC command takes over 200ms now - most of which are
taken by some python initialization - so some time ago Jim introduced
a possibility to send multiple rpc commands in single rpc.py run. This
allowed us to send 5, 6, or 10 RPC commands within the same ~200ms,
but it impaired our ability to debug those unexpected RPC failures. If
the first 5 commands succeeded, but the sixth failed, what is the
output going to look like? There will be a bash xtrace for sending the
whole batch, maybe some error messages from rpc.py and then xtraces
from the ERR trap. The rpc.py output could be made somewhat human
friendly, but nevertheless it would be something unusual to the user.
I think the question would always arise - why not send each rpc
command separately, so that bash will trigger the ERR trap as soon as
possible? I see we're trying to batch RPC commands in more and more
places and I'm currently blocking each patch doing so. Instead, I
propose to optimize sending an RPC command down to ~10ms. We can
achieve that by running rpc.py as a daemon listening for commands on a
Unix domain socket. The patch for that is available on gerrithub.
2) Rely on non-zero error code wherever possible. Don't write if
statements for commands that can fail just in order to echo some error
message. That takes a few additional lines of code and in most cases
isn't going to help the user in debugging the failure. No matter how
descriptive that error message is, the person looking at the logs will
most likely need to examine the rest of the test case or even the
whole test script anyway. As we will now display bash comments in the
test output, it's a better idea to precede each non-obvious chunk of
commands with a comment describing what we're doing (and why).
3) In cases where we need to check e.g. some specific fields in an RPC
response, use plain bash test:
[ "$(jq -r '..num_blocks' <<< "$some_json_data")"
= "$some_expected_value" ]
It will trigger ERR if it's not true. There's also no need to add ` ||
false` at the end.
4) Enforce more bash error checks. I would like to enable `set -u` to
treat variable substitution failure as an error (SFIAE? :-) ) and `set
-o pipefail` to trigger ERR on e.g. `false | true | true`. We can't
enable those just yet because a lot of our tests would start failing.
We still need to figure out some details on how we want our test
scripts to look like in the end. What is certain now, is that some of
our "common" scripts meant to be just sourced somewhere, perform a few
checks just upon sourcing them and can even call `exit 1` if some
condition is not met. I'm thinking specifically about
test/vhost/common.sh. This file contains functions for .e.g. starting
an SPDK vhost instance or connecting a QEMU VM to it. Because of that,
it defines a variable with a path to qemu binary. The common.sh file
won't actually re-define that variable if it's set before sourcing, so
that's fine, but just as it's sourced it will check if the file at
that path actually exists and may `exit 1` if it doesn't. What if we
want to use vhost target with a userspace vhost pmd instead of QEMU?
There is a test like this in our repo but it can't be run locally if
the user doesn't have qemu installed in some abritrary path. If we
enforce `set -u`, we won't even need to define that path to a qemu
binary at the beginning of the script, eliminating this entire
5) Do not override the default autotest traps in particular test
scripts. The expected error handling flow is to first trigger the ERR
trap to print the backtrace. That trap is defined in
autotest_common.sh. Then trigger the EXIT trap to be defined by the
exact test script to perform some kind of cleanup inside. We currently
override the ERR trap in a couple of test scripts which results in not
having the backtrace printed sometimes.
SPDK mailing list
SPDK mailing list