Skip to content

Commit b7170c3

Browse files
[JSC] detect infrastructure failure for remote stress tests
https://bugs.webkit.org/show_bug.cgi?id=222601 Reviewed by Yusuke Suzuki. run-jsc-stress-tests currently detects failures by the absence of a failure file (that is generated by each failing test). This is fragile to begin with, as it assumes that tests that fail to run (e.g. because of an error in the runner script) are successful by default. However, the main motivation for this patch is to make execution more robust when using remote hosts. Currently, --gnu-parallel-runner will transparently reschedule jobs on a different host when a remote host goes away. But detectFailures expects to be able to connect to all hosts and fetch the failure files, which fails if a remote host is still down when the run finishes. Instead, this patch changes the runners to always generate a status file with the exit code. detectFailures then fetches all status files from all hosts that are live on exit. Tests that failed to run are explicitly accounted for as 'noreport' and are set to ERROR in the final report. * Scripts/run-javascriptcore-tests: (runJSCStressTests): * Scripts/run-jsc-stress-tests: * Scripts/webkitruby/jsc-stress-test-writer-default.rb: Canonical link: https://commits.webkit.org/236372@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@275801 268f45cc-cd09-0410-ab3c-d52691b4dbfc
1 parent 9ab223c commit b7170c3

4 files changed

Lines changed: 200 additions & 59 deletions

File tree

Tools/ChangeLog

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,35 @@
1+
2021-04-10 Angelos Oikonomopoulos <angelos@igalia.com>
2+
3+
[JSC] detect infrastructure failure for remote stress tests
4+
https://bugs.webkit.org/show_bug.cgi?id=222601
5+
6+
Reviewed by Yusuke Suzuki.
7+
8+
run-jsc-stress-tests currently detects failures by the absence of
9+
a failure file (that is generated by each failing test). This is
10+
fragile to begin with, as it assumes that tests that fail to run
11+
(e.g. because of an error in the runner script) are successful by
12+
default.
13+
14+
However, the main motivation for this patch is to make execution
15+
more robust when using remote hosts. Currently,
16+
--gnu-parallel-runner will transparently reschedule jobs on a
17+
different host when a remote host goes away. But detectFailures
18+
expects to be able to connect to all hosts and fetch the failure
19+
files, which fails if a remote host is still down when the run
20+
finishes.
21+
22+
Instead, this patch changes the runners to always generate a status
23+
file with the exit code. detectFailures then fetches all status
24+
files from all hosts that are live on exit. Tests that failed to
25+
run are explicitly accounted for as 'noreport' and are set to
26+
ERROR in the final report.
27+
28+
* Scripts/run-javascriptcore-tests:
29+
(runJSCStressTests):
30+
* Scripts/run-jsc-stress-tests:
31+
* Scripts/webkitruby/jsc-stress-test-writer-default.rb:
32+
133
2021-04-10 Aakash Jain <aakash_jain@apple.com>
234

335
Improve step description when compile-webkit step is skipped

Tools/Scripts/run-javascriptcore-tests

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -932,9 +932,20 @@ sub runJSCStressTests
932932
}
933933
print "\n";
934934

935+
my @jscStressNoResultList = readAllLines($jscStressResultsDir . "/noresult");
936+
my $numJSCStressNoResultTests = @jscStressNoResultList;
937+
938+
if ($numJSCStressNoResultTests) {
939+
$isTestFailed = 1;
940+
}
941+
foreach my $testNoResult (@jscStressNoResultList) {
942+
$reportData{$testNoResult} = {actual => "ERROR"};
943+
}
944+
935945
print "Results for JSC stress tests:\n";
936946
printThingsFound($numJSCStressFailures, "failure", "failures", "found");
937-
print " OK.\n" if $numJSCStressFailures == 0;
947+
printThingsFound($numJSCStressNoResultTests, "test", "tests", "failed to complete");
948+
print " OK.\n" if $numJSCStressFailures == 0 and $numJSCStressNoResultTests == 0;
938949

939950
print "\n";
940951

Tools/Scripts/run-jsc-stress-tests

Lines changed: 95 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,9 @@ raise unless SCRIPTS_PATH.basename.to_s == "Scripts"
5757
raise unless SCRIPTS_PATH.dirname.basename.to_s == "Tools"
5858

5959
HELPERS_PATH = SCRIPTS_PATH + "jsc-stress-test-helpers"
60+
STATUS_FILE_PREFIX = "test_status_"
61+
STATUS_FILE_PASS = "P"
62+
STATUS_FILE_FAIL = "F"
6063

6164
begin
6265
require 'shellwords'
@@ -134,6 +137,7 @@ $forceCollectContinuously = false
134137
$reportExecutionTime = false
135138
$ldd = nil
136139
$artifact_exec_wrapper = nil
140+
$runUniqueId = Random.new.bytes(16).unpack("H*")[0]
137141

138142
def usage
139143
puts "run-jsc-stress-tests -j <shell path> <collections path> [<collections path> ...]"
@@ -535,9 +539,6 @@ if $testWriter
535539
end
536540
end
537541

538-
$numFailures = 0
539-
$numPasses = 0
540-
541542
# We force all tests to use a smaller (1.5M) stack so that stack overflow tests can run faster.
542543
BASE_OPTIONS = ["--useFTLJIT=false", "--useFunctionDotArguments=true", "--validateExceptionChecks=true", "--useDollarVM=true", "--maxPerThreadStackUsage=1572864"]
543544
EAGER_OPTIONS = ["--thresholdForJITAfterWarmUp=10", "--thresholdForJITSoon=10", "--thresholdForOptimizeAfterWarmUp=20", "--thresholdForOptimizeAfterLongWarmUp=20", "--thresholdForOptimizeSoon=20", "--thresholdForFTLOptimizeAfterWarmUp=20", "--thresholdForFTLOptimizeSoon=20", "--thresholdForOMGOptimizeAfterWarmUp=20", "--thresholdForOMGOptimizeSoon=20", "--maximumEvalCacheableSourceLength=150000", "--useEagerCodeBlockJettisonTiming=true", "--repatchBufferingCountdown=0"]
@@ -1821,15 +1822,20 @@ def appendFailure(plan)
18211822
| outp |
18221823
outp.puts plan.name
18231824
}
1824-
$numFailures += 1
18251825
end
18261826

18271827
def appendPass(plan)
18281828
File.open($outputDir + "passed", "a") {
18291829
| outp |
18301830
outp.puts plan.name
18311831
}
1832-
$numPasses += 1
1832+
end
1833+
1834+
def appendNoResult(plan)
1835+
File.open($outputDir + "noresult", "a") {
1836+
| outp |
1837+
outp.puts plan.name
1838+
}
18331839
end
18341840

18351841
def appendResult(plan, didPass)
@@ -2037,7 +2043,7 @@ def cleanRunnerDirectory
20372043
}
20382044
end
20392045

2040-
def sshRead(cmd, remoteIndex=0)
2046+
def sshRead(cmd, remoteIndex=0, options={})
20412047
raise unless $remote
20422048

20432049
remoteHost = $remoteHosts[remoteIndex]
@@ -2050,7 +2056,7 @@ def sshRead(cmd, remoteIndex=0)
20502056
result += line
20512057
}
20522058
}
2053-
raise "#{$?}" unless $?.success?
2059+
raise "#{$?}" unless $?.success? or options[:ignoreFailure]
20542060
result
20552061
end
20562062

@@ -2205,52 +2211,110 @@ def runTestRunner(remoteIndex=0)
22052211
end
22062212
end
22072213

2208-
def detectFailures
2209-
raise if $bundle
2210-
failures = []
2214+
def getStatusMap
2215+
name_re = /^[.]\/#{STATUS_FILE_PREFIX}(\d+)$/
2216+
map = {}
22112217
if $remote
22122218
$remoteHosts.each_with_index {
22132219
| host, remoteIndex |
2214-
output = sshRead("cd #{host.remoteDirectory}/#{$outputDir.basename}/.runner && find . -maxdepth 1 -name \"test_fail_*\"", remoteIndex)
2220+
output = sshRead("cd #{host.remoteDirectory}/#{$outputDir.basename}/.runner && find . -maxdepth 1 -name \"#{STATUS_FILE_PREFIX}*\" -exec sh -c \"printf \\\"%s \\\" {}; cat {}\" \\;", remoteIndex, :ignoreFailure => true)
22152221
output.split(/\n/).each {
22162222
| line |
2217-
next unless line =~ /test_fail_/
2218-
failures << $~.post_match.to_i
2223+
name, run_id, _, result = line.split(' ')
2224+
md = name_re.match(name)
2225+
if md.nil?
2226+
$stderr.puts("Could not parse name in `#{line}`")
2227+
exit(1)
2228+
end
2229+
if run_id != $runUniqueId
2230+
# This may conceivably happen if a remote goes
2231+
# away in the middle of a run and comes back
2232+
# online in the middle of a different run.
2233+
$stderr.puts("Ignoring stale status file for #{name} (ID #{run_id} but current ID is #{$runUniqueId})")
2234+
next
2235+
end
2236+
index = md[1].to_i
2237+
if map.has_key?(index)
2238+
$stderr.puts("Duplicate state file for #{index}")
2239+
# One scenario in which this could happen:
2240+
# Test T runs on remote host A and
2241+
# 1. the status file reaches A's disk
2242+
# 2. somehow the gnu parallel runner is not made aware of the test's completion (packet loss?)
2243+
# 3. A machine crashes
2244+
# 4. gnu parallel re-schedules the test to run on remote host B, where it runs to completion
2245+
# 5. B comes back online before the end of the run
2246+
# 6. we collect the status files from all remotes and end up with two status files for T.
2247+
prev = map[index]
2248+
# map[index] holds
2249+
# - a number, if all results codes we've observed for a test are the same
2250+
# - an array, if they diverge.
2251+
if prev.is_a?(Array)
2252+
prev.push(result)
2253+
elsif prev != result
2254+
# If the two results differ, keep them
2255+
# both. This is simply a way to make note of
2256+
# the divergence (for later reporting).
2257+
map[index] = [prev, result]
2258+
else
2259+
# Got the same result, no need to do anything.
2260+
end
2261+
else
2262+
map[index] = result
2263+
end
22192264
}
22202265
}
22212266
else
22222267
Dir.foreach($runnerDir) {
22232268
| filename |
2224-
next unless filename =~ /test_fail_/
2225-
failures << $~.post_match.to_i
2269+
md = name_re.match("./#{filename}")
2270+
next unless md
2271+
File.open("#{$runnerDir}/#{filename}", "r") { |f|
2272+
runId, _, result = f.read.chomp.split(' ')
2273+
if runId != $runUniqueId
2274+
# We clean the dir before a starting a run.
2275+
raise "Can't happen"
2276+
end
2277+
map[md[1].to_i] = result
2278+
}
22262279
}
22272280
end
2281+
map
2282+
end
22282283

2229-
failureSet = {}
2230-
2231-
failures.each {
2232-
| failure |
2233-
appendFailure($runlist[failure])
2234-
failureSet[failure] = true
2235-
}
2236-
2284+
def detectFailures
2285+
raise if $bundle
2286+
noresult = 0
2287+
statusMap = getStatusMap
22372288
familyMap = {}
2289+
22382290
$runlist.each_with_index {
22392291
| plan, index |
22402292
unless familyMap[plan.family]
22412293
familyMap[plan.family] = []
22422294
end
2243-
if failureSet[index]
2244-
appendResult(plan, false)
2245-
familyMap[plan.family] << {:result => "FAIL", :plan => plan};
2295+
if not statusMap.has_key?(index) or statusMap[index].is_a?(Array)
2296+
appendNoResult(plan)
2297+
noresult += 1
22462298
next
2299+
end
2300+
result = nil
2301+
if statusMap[index] == STATUS_FILE_PASS
2302+
appendPass(plan)
2303+
result = "PASS"
22472304
else
2248-
appendResult(plan, true)
2249-
familyMap[plan.family] << {:result => "PASS", :plan => plan};
2305+
appendFailure(plan)
2306+
result = "FAIL"
22502307
end
2251-
appendPass(plan)
2308+
appendResult(plan, statusMap[index] == STATUS_FILE_PASS)
2309+
familyMap[plan.family] << {:result => result, :plan => plan }
22522310
}
22532311

2312+
if noresult > 0
2313+
$stderr.puts("Could not get the exit status for #{noresult} tests")
2314+
# We can't change our exit code, as run-javascriptcore-tests
2315+
# expects 0 even when there are failures.
2316+
end
2317+
22542318
File.open($outputDir + "resultsByFamily", "w") {
22552319
| outp |
22562320
first = true
@@ -2261,7 +2325,7 @@ def detectFailures
22612325
else
22622326
outp.puts
22632327
end
2264-
2328+
22652329
outp.print "#{familyName}:"
22662330

22672331
numPassed = 0
@@ -2299,6 +2363,7 @@ end
22992363

23002364
clean($outputDir + "failed")
23012365
clean($outputDir + "passed")
2366+
clean($outputDir + "noresult")
23022367
clean($outputDir + "results")
23032368
clean($outputDir + "resultsByFamily")
23042369
clean($outputDir + ".vm")

0 commit comments

Comments
 (0)