|
|
DescriptionMove fail on suspend logic from URLRequestJob to TcpClientSocket.
The old code had a couple issues:
1) It failed requests on entry to suspend mode, regardless of whether a
read was actually pending. This could confuse state machines that were
asynchronously processing partial results. Fixing this at the URLRequest
layer is made difficult due to the URLRequestJob's complicated state
machine. TcpClientSocket has a much simpler state machine.
2) It failed requests with ERR_ABORTED, which normally means the
URLRequest was cancelled by something above the network stack, so is
not accurate in this instance. The RenderView also has magic handling
for this error code, unclear if that logic is desireable in this case.
If it is, the network stack should not be responsible for ensuring that
logic is triggered.
3) This is more minor, but it also resulted in failing requests that
aren't going over th network.
4) It only applied to URLRequests, and not other TCP sockets that may
be hung as a result of entering/exiting suspend mode.
This CL fixes 1-3, with fixing 4 a potential trivial followup to this.
The new logic is only enabled for requests that go through a socket
pool, and FTP sockets in this CL, with the HostResolver also being
updated to fail requests on suspend mode. This should mostly catch
the cases the old code caught, with a fair number of exceptions
(Doesn't cover QUIC, does cover WebSockets after a connection is
established, does cover a couple classes that don't talk HTTP but
still use socket pools to get proxy support, fails HTTP2 at the
session layer, doesn't try and drain HTTP sockets of data for reuse on
network changes).
BUG=651120
Patch Set 1 #Patch Set 2 : Fix NaCl #Patch Set 3 : Really fix NaCl??? #Patch Set 4 : Fix callbacks #Patch Set 5 : Skip test #Patch Set 6 : Fix connect error case #
Total comments: 2
Patch Set 7 : Merge #Patch Set 8 : Not rightly sure what this is... #Patch Set 9 : Fix #Patch Set 10 : . #
Messages
Total messages: 57 (38 generated)
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_compile_dbg_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...)
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
mmenke@chromium.org changed reviewers: + bnc@google.com, eroman@chromium.org, rch@chromium.org, rdsmith@chromium.org
Hey guys, I don't actually want reviews here (I need to add a bunch of tests, and do some manual testing before this is ready for review, and don't need 4 reviewers), but I wanted to run this idea in front of some more eyes to see if there are reasons why doing this is a really bad idea. The summary: Currently, when entering suspend mode, we fail all URLRequestJobs with ERR_ABORTED. This is because when we resume, such sockets will often not return errors, and effectively be blackholed. How we do this has issues mentioned in the CL description. Moving the failure to the socket layer fixes these, and also gives us the option of applying this logic more widely, since it seems like this is more a TCP issue than a URLRequest issue. This code also removes the behavior from QUIC itself (Which may be able to properly handle this case...Hopefully?). Thoughts / comments / vociferous objections to this are all welcome.
Description was changed from ========== Move failing on suspend logic from URLRequestJob to TcpClientSocket. The old code had a couple issues: 1) It failed requests on entry to suspend mode, regardless of whether a read was actually pending. This could confuse state machines that were asynchronously processing partial results. Fixing this at the URLRequest layer is made difficult due to the URLRequestJob's complicated state machine. TcpClientSocket has a much simpler state machine. 2) It failed requests with ERR_ABORTED, which normally means the URLRequest was cancelled by something above the network stack, so is not accurate in this instance. The RenderView also has magic handling for this error code, unclear if that logic is desireable in this case. If it is, the network stack should not be responsible for ensuring that logic is triggered. 3) This is more minor, but it also resulted in failing requests that aren't going over th network. 4) It only applied to URLRequests, and not other TCP sockets that may be hung as a result of entering/exiting suspend mode. This CL fixes 1-3, with fixing 4 a potential trivial followup to this. The new logic is only enabled for requests that go through a socket pool, and FTP sockets in this CL, with the HostResolver also being updated to fail requests on suspend mode. This should mostly catch the cases the old code caught, with a fair number of exceptions (Doesn't cover QUIC, does cover WebSockets after a connection is established, does cover a couple classes that don't talk HTTP but still use socket pools to get proxy support, fails HTTP2 at the session layer, doesn't try and drain HTTP sockets of data for reuse on network changes). BUG=651120 ========== to ========== Move fail on suspend logic from URLRequestJob to TcpClientSocket. The old code had a couple issues: 1) It failed requests on entry to suspend mode, regardless of whether a read was actually pending. This could confuse state machines that were asynchronously processing partial results. Fixing this at the URLRequest layer is made difficult due to the URLRequestJob's complicated state machine. TcpClientSocket has a much simpler state machine. 2) It failed requests with ERR_ABORTED, which normally means the URLRequest was cancelled by something above the network stack, so is not accurate in this instance. The RenderView also has magic handling for this error code, unclear if that logic is desireable in this case. If it is, the network stack should not be responsible for ensuring that logic is triggered. 3) This is more minor, but it also resulted in failing requests that aren't going over th network. 4) It only applied to URLRequests, and not other TCP sockets that may be hung as a result of entering/exiting suspend mode. This CL fixes 1-3, with fixing 4 a potential trivial followup to this. The new logic is only enabled for requests that go through a socket pool, and FTP sockets in this CL, with the HostResolver also being updated to fail requests on suspend mode. This should mostly catch the cases the old code caught, with a fair number of exceptions (Doesn't cover QUIC, does cover WebSockets after a connection is established, does cover a couple classes that don't talk HTTP but still use socket pools to get proxy support, fails HTTP2 at the session layer, doesn't try and drain HTTP sockets of data for reuse on network changes). BUG=651120 ==========
On 2016/10/26 20:01:54, mmenke wrote: > This code also removes the behavior from QUIC itself (Which may be able to > properly handle this case...Hopefully?). I'll think more about about the non-QUIC ramifications, but in the meantime one thought about QUIC. The QUIC connection idle timeout is 30s (though we have a 5min experiment running). If the suspend period is longer than that then the server will already have closed the connection. Do you have any idea how long periods of suspend are typically?
On 2016/10/26 22:00:07, Ryan Hamilton wrote: > On 2016/10/26 20:01:54, mmenke wrote: > > This code also removes the behavior from QUIC itself (Which may be able to > > properly handle this case...Hopefully?). > > I'll think more about about the non-QUIC ramifications, but in the meantime one > thought about QUIC. The QUIC connection idle timeout is 30s (though we have a > 5min experiment running). If the suspend period is longer than that then the > server will already have closed the connection. Do you have any idea how long > periods of suspend are typically? I'd guess on the order of 10 minutes to hours or days. Less than 30 seconds is certainly possible, if someone realizes they wanted to do something on their PC right after closing the lid.
On 2016/10/26 22:05:38, mmenke wrote: > On 2016/10/26 22:00:07, Ryan Hamilton wrote: > > On 2016/10/26 20:01:54, mmenke wrote: > > > This code also removes the behavior from QUIC itself (Which may be able to > > > properly handle this case...Hopefully?). > > > > I'll think more about about the non-QUIC ramifications, but in the meantime > one > > thought about QUIC. The QUIC connection idle timeout is 30s (though we have a > > 5min experiment running). If the suspend period is longer than that then the > > server will already have closed the connection. Do you have any idea how long > > periods of suspend are typically? > > I'd guess on the order of 10 minutes to hours or days. Less than 30 seconds is > certainly possible, if someone realizes they wanted to do something on their PC > right after closing the lid. if that's the case, then there's probably not much value in keeping QUIC sessions alive. :/
On 2016/10/26 22:09:35, Ryan Hamilton wrote: > On 2016/10/26 22:05:38, mmenke wrote: > > On 2016/10/26 22:00:07, Ryan Hamilton wrote: > > > On 2016/10/26 20:01:54, mmenke wrote: > > > > This code also removes the behavior from QUIC itself (Which may be able to > > > > properly handle this case...Hopefully?). > > > > > > I'll think more about about the non-QUIC ramifications, but in the meantime > > one > > > thought about QUIC. The QUIC connection idle timeout is 30s (though we have > a > > > 5min experiment running). If the suspend period is longer than that then the > > > server will already have closed the connection. Do you have any idea how > long > > > periods of suspend are typically? > > > > I'd guess on the order of 10 minutes to hours or days. Less than 30 seconds > is > > certainly possible, if someone realizes they wanted to do something on their > PC > > right after closing the lid. > > if that's the case, then there's probably not much value in keeping QUIC > sessions alive. :/ I'm happy to hook up QUIC sessions to the signal, too, if they need it (PostDelayedTask looks at CPU ticks, not wall time, so I think any timer it uses wouldn't trigger on resume).
On 2016/10/26 22:12:13, mmenke wrote: > On 2016/10/26 22:09:35, Ryan Hamilton wrote: > > On 2016/10/26 22:05:38, mmenke wrote: > > > On 2016/10/26 22:00:07, Ryan Hamilton wrote: > > > > On 2016/10/26 20:01:54, mmenke wrote: > > > > > This code also removes the behavior from QUIC itself (Which may be able > to > > > > > properly handle this case...Hopefully?). > > > > > > > > I'll think more about about the non-QUIC ramifications, but in the > meantime > > > one > > > > thought about QUIC. The QUIC connection idle timeout is 30s (though we > have > > a > > > > 5min experiment running). If the suspend period is longer than that then > the > > > > server will already have closed the connection. Do you have any idea how > > long > > > > periods of suspend are typically? > > > > > > I'd guess on the order of 10 minutes to hours or days. Less than 30 seconds > > is > > > certainly possible, if someone realizes they wanted to do something on their > > PC > > > right after closing the lid. > > > > if that's the case, then there's probably not much value in keeping QUIC > > sessions alive. :/ > > I'm happy to hook up QUIC sessions to the signal, too, if they need it > (PostDelayedTask looks at CPU ticks, not wall time, so I think any timer it uses > wouldn't trigger on resume). Really?! Ugh. Quic's idle timeout detection (and all of QUIC's internal bookkeeping) is based on base::TimeTicks::Now(). If the computer goes to sleep, does that mean that TimeTicks don't move forward during that period? If so, that sounds excitingly problematic. Maybe I'm over thinking this.... let me try to think what ramifications this might have.
On 2016/10/26 22:16:09, Ryan Hamilton wrote: > On 2016/10/26 22:12:13, mmenke wrote: > > On 2016/10/26 22:09:35, Ryan Hamilton wrote: > > > On 2016/10/26 22:05:38, mmenke wrote: > > > > On 2016/10/26 22:00:07, Ryan Hamilton wrote: > > > > > On 2016/10/26 20:01:54, mmenke wrote: > > > > > > This code also removes the behavior from QUIC itself (Which may be > able > > to > > > > > > properly handle this case...Hopefully?). > > > > > > > > > > I'll think more about about the non-QUIC ramifications, but in the > > meantime > > > > one > > > > > thought about QUIC. The QUIC connection idle timeout is 30s (though we > > have > > > a > > > > > 5min experiment running). If the suspend period is longer than that then > > the > > > > > server will already have closed the connection. Do you have any idea how > > > long > > > > > periods of suspend are typically? > > > > > > > > I'd guess on the order of 10 minutes to hours or days. Less than 30 > seconds > > > is > > > > certainly possible, if someone realizes they wanted to do something on > their > > > PC > > > > right after closing the lid. > > > > > > if that's the case, then there's probably not much value in keeping QUIC > > > sessions alive. :/ > > > > I'm happy to hook up QUIC sessions to the signal, too, if they need it > > (PostDelayedTask looks at CPU ticks, not wall time, so I think any timer it > uses > > wouldn't trigger on resume). > > Really?! Ugh. Quic's idle timeout detection (and all of QUIC's internal > bookkeeping) is based on base::TimeTicks::Now(). If the computer goes to sleep, > does that mean that TimeTicks don't move forward during that period? If so, that > sounds excitingly problematic. Maybe I'm over thinking this.... let me try to > think what ramifications this might have. Hrm...Looks like I may be wrong about that.
On 2016/10/26 22:18:04, mmenke wrote: > On 2016/10/26 22:16:09, Ryan Hamilton wrote: > > On 2016/10/26 22:12:13, mmenke wrote: > > > On 2016/10/26 22:09:35, Ryan Hamilton wrote: > > > > On 2016/10/26 22:05:38, mmenke wrote: > > > > > On 2016/10/26 22:00:07, Ryan Hamilton wrote: > > > > > > On 2016/10/26 20:01:54, mmenke wrote: > > > > > > > This code also removes the behavior from QUIC itself (Which may be > > able > > > to > > > > > > > properly handle this case...Hopefully?). > > > > > > > > > > > > I'll think more about about the non-QUIC ramifications, but in the > > > meantime > > > > > one > > > > > > thought about QUIC. The QUIC connection idle timeout is 30s (though we > > > have > > > > a > > > > > > 5min experiment running). If the suspend period is longer than that > then > > > the > > > > > > server will already have closed the connection. Do you have any idea > how > > > > long > > > > > > periods of suspend are typically? > > > > > > > > > > I'd guess on the order of 10 minutes to hours or days. Less than 30 > > seconds > > > > is > > > > > certainly possible, if someone realizes they wanted to do something on > > their > > > > PC > > > > > right after closing the lid. > > > > > > > > if that's the case, then there's probably not much value in keeping QUIC > > > > sessions alive. :/ > > > > > > I'm happy to hook up QUIC sessions to the signal, too, if they need it > > > (PostDelayedTask looks at CPU ticks, not wall time, so I think any timer it > > uses > > > wouldn't trigger on resume). > > > > Really?! Ugh. Quic's idle timeout detection (and all of QUIC's internal > > bookkeeping) is based on base::TimeTicks::Now(). If the computer goes to > sleep, > > does that mean that TimeTicks don't move forward during that period? If so, > that > > sounds excitingly problematic. Maybe I'm over thinking this.... let me try to > > think what ramifications this might have. > > Hrm...Looks like I may be wrong about that. Actually, I'm not sure. Worth experimenting with, if you're strongly depending on one behavior there.
bnc@chromium.org changed reviewers: + bnc@chromium.org
https://codereview.chromium.org/2447083003/diff/100001/net/socket/stream_sock... File net/socket/stream_socket.h (right): https://codereview.chromium.org/2447083003/diff/100001/net/socket/stream_sock... net/socket/stream_socket.h:118: // connected to remote systems. See https://crbug.com/4606). Unmatched ')'.
So it won't surprise you to hear that I'm in favor of this CL; removing asynchronous, uncontrolled by the code events in favor of funneling errors through well-tested error handling pathways SGTM. I wouldn't expect functionality problems from the addition of putting failure in at the socket layer, since all clients need to be able to handle sockets being closed on them asynchronously anyway. So to me that leaves the following possible sources of problems: * Things working that didn't before. I'm thinking specifically of URLRequests that are being serviced from cache surviving fine across a suspend. I don't see any reason why this would cause problems, but I raise it as an expected change in behavior. * Things that are part of the URLRequestJob pathway that aren't part of the TCP socket failure pathway. QUIC's obvious here, and you and Ryan are figuring that out. And you've named several other possibilities. I'm not sure I'm going to be of help here, as this is about knowing corner cases of the code, and none come to me that you haven't mentioned. * Things that are relying on the push notification of failure for good behavior, and will not behave as well on pull. The only thing that comes to mind here is "hanging gets". How does this change deal with outstanding async reads? Two other thoughts: * As I say above, I wouldn't expect problems from the socket change, only from the URLRequestJob change. While I'm sure I'm wrong :-}, that suggests that possibility of making the socket change global and seeing how hard it is to flush out the problems it produces. Do you expect problems from just the socket change? * The above thought makes me wondering about splitting this CL in two, one for the socket change and one for the URLRequestJob change. This would require making sure that both of them being sensitive to suspends didn't have any bad synergistic effects. FWIW.
On 2016/10/27 15:51:11, Randy Smith - Not in Mondays wrote: > So it won't surprise you to hear that I'm in favor of this CL; removing > asynchronous, uncontrolled by the code events in favor of funneling errors > through well-tested error handling pathways SGTM. > > I wouldn't expect functionality problems from the addition of putting failure in > at the socket layer, since all clients need to be able to handle sockets being > closed on them asynchronously anyway. So to me that leaves the following > possible sources of problems: > > * Things working that didn't before. I'm thinking specifically of URLRequests > that are being serviced from cache surviving fine across a suspend. I don't see > any reason why this would cause problems, but I raise it as an expected change > in behavior. > > * Things that are part of the URLRequestJob pathway that aren't part of the TCP > socket failure pathway. QUIC's obvious here, and you and Ryan are figuring that > out. And you've named several other possibilities. I'm not sure I'm going to > be of help here, as this is about knowing corner cases of the code, and none > come to me that you haven't mentioned. > > * Things that are relying on the push notification of failure for good behavior, > and will not behave as well on pull. The only thing that comes to mind here is > "hanging gets". How does this change deal with outstanding async reads? > > Two other thoughts: > * As I say above, I wouldn't expect problems from the socket change, only from > the URLRequestJob change. While I'm sure I'm wrong :-}, that suggests that > possibility of making the socket change global and seeing how hard it is to > flush out the problems it produces. Do you expect problems from just the socket > change? > > * The above thought makes me wondering about splitting this CL in two, one for > the socket change and one for the URLRequestJob change. This would require > making sure that both of them being sensitive to suspends didn't have any bad > synergistic effects. > > FWIW. Oh, I'll also add: Is there any way (probably in a different CL) to postpone the error until the kernel buffer has been drained of any data that arrived before the suspend? Maybe a read with O_NONBLOCK?
On 2016/10/27 16:04:23, Randy Smith - Not in Mondays wrote: > On 2016/10/27 15:51:11, Randy Smith - Not in Mondays wrote: > > So it won't surprise you to hear that I'm in favor of this CL; removing > > asynchronous, uncontrolled by the code events in favor of funneling errors > > through well-tested error handling pathways SGTM. > > > > I wouldn't expect functionality problems from the addition of putting failure > in > > at the socket layer, since all clients need to be able to handle sockets being > > closed on them asynchronously anyway. So to me that leaves the following > > possible sources of problems: > > > > * Things working that didn't before. I'm thinking specifically of URLRequests > > that are being serviced from cache surviving fine across a suspend. I don't > see > > any reason why this would cause problems, but I raise it as an expected change > > in behavior. > > > > * Things that are part of the URLRequestJob pathway that aren't part of the > TCP > > socket failure pathway. QUIC's obvious here, and you and Ryan are figuring > that > > out. And you've named several other possibilities. I'm not sure I'm going to > > be of help here, as this is about knowing corner cases of the code, and none > > come to me that you haven't mentioned. > > > > * Things that are relying on the push notification of failure for good > behavior, > > and will not behave as well on pull. The only thing that comes to mind here > is > > "hanging gets". How does this change deal with outstanding async reads? If there are outstanding reads/writes/connects, this will fail them (If the socket is destroyed/disconnected after it fails one, before it fails the other, it will just fail one of them). It will also fail all new reads/writes (But not connects). Once the socket is disconnected, or connected again without disconnecting, the socket will work as normal (Not sure if connecting against without disconnecting is expected to work - API isn't clear on that) > > Two other thoughts: > > * As I say above, I wouldn't expect problems from the socket change, only from > > the URLRequestJob change. While I'm sure I'm wrong :-}, that suggests that > > possibility of making the socket change global and seeing how hard it is to > > flush out the problems it produces. Do you expect problems from just the > socket > > change? My main concern with the socket change is with connections to localhost. I'd like to apply it just to some sockets at first, see how things go, and then if that goes well apply it to all sockets a few releases later, possibly with a localhost exemption. > > * The above thought makes me wondering about splitting this CL in two, one for > > the socket change and one for the URLRequestJob change. This would require > > making sure that both of them being sensitive to suspends didn't have any bad > > synergistic effects. This makes sense...I don't think all synergistic effects can be avoided (Just consider requests that, on failure, trigger other requests, that reach the DNS layer, and vice versa, though that kind of issue at the DNS layer seems unavoidable, even if we do it all at once). I think, at worse, we'd be failing extra requests that haven't yet actually reached the network layer when going into suspend mode, which I don't think is a huge deal. > > FWIW. > > Oh, I'll also add: Is there any way (probably in a different CL) to postpone the > error until the kernel buffer has been drained of any data that arrived before > the suspend? Maybe a read with O_NONBLOCK? Hrm...I could try delaying the error until an operation doesn't complete synchronously, but that sounds unlikely to work well, on full duplex sockets. Suppose we can read data, but there's only a write pending. Do we wait for the consumer to read again? What if they don't? An arbitrary timeout? I don't think I'm going to have something ready to land in the next 3 weeks (Before branch), so I'm thinking maybe switch the URLRequestJob logic to return ERR_NETWORK_IO_SUSPENDED on entering suspend mode, in the meantime, to make sure at least that change is safe, and so we effectively land fewer changes at once.
And thanks for all the feedback, Randy!
> If the computer goes to sleep, does that mean that TimeTicks don't move forward during that period? Unfortunately it depends on the platform -- We don't have a consistent policy in implementations of base::TimeTicks. You shouldn't assume this, but in practice: * On Windows TimeTicks continues ticking while the system is asleep (when using QueryPerformanceCounter) * On Linux and Mac, TimeTicks stop ticking while the system is asleep (by virtue of using CLOCK_MONOTONIC on Linux, or mach_absolute_time() on Mac) Linux has had support for CLOCK_BOOTTIME for a while, but we don't use it. For more information on TimeTicks see https://goo.gl/uNvCYu
Thanks Matt, I am very supportive of moving this logic down to the socket layer! A few high level comments before going into CL specifics: (1) I am skeptical of the value of making HostResolverImpl a power observer. (Discounting the async resolver in the justification below): * We can't properly cancel outstanding requests thanks to getaddrinfo() being blocking on a worker thread. So at best we are abandoning the attempt threads. As you point out, there may be a flurry of activity in cancelling and possibly re-starting along these change edges, so we may not even properly accomplish that. * We already have a baked in mechanism to re-try getaddrinfo() when things are taking a long time. Given this, I would expect doing nothing is better. If the getaddrinfo() runs to successful completion (yay). If it fails quickly after resume, that is fine too (no worse than having aborted and failed). If it gets wedged and takes a while to timeout after resume, then the hostresolverimpl should kick off another attempt thread and win the race. (2) Should we even be doing this by default? We have a long history behind this unsavory hack, which goes back to working around a blue screen of death that was happening on windows (long since fixed). Fundamentally it feels like brokenness in the host network implementation that we are working around, which is why I wholeheartedly agree with putting it into a layer close to that in our code, and trying to better contain the ugliness. I would love if we can better justify the concrete instances where we benefit from this. There are definite downsides to doing this, which leads to a frustrating user experience (same thing goes for ERR_NETWORK_CHANGED), which I am not sure that we are making the right tradeoff on. Ideally we would only enable this for OS versions/configurations where we know it to be a problem. If it truly is a real issue, then we may want to solve this with a more general mechanism of identifying wedged sockets (maybe an aggressive timeout on reads immediately following a power-mode resumption, but not actively aborting stuff). Other applications (including IIRC Firefox) don't do these sorts of hacks, and I would like to better understand what benefits we get from them. Thoughts? https://codereview.chromium.org/2447083003/diff/100001/net/dns/host_resolver_... File net/dns/host_resolver_impl.cc (right): https://codereview.chromium.org/2447083003/diff/100001/net/dns/host_resolver_... net/dns/host_resolver_impl.cc:2472: AbortAllInProgressJobs(); This is going to fail the requests with ERR_NETWORK_CHANGEd, and *not* ERR_NETWORK_IO_SUSPENDED. Is that intentional?
On 2016/10/27 19:26:26, eroman wrote: > Thanks Matt, I am very supportive of moving this logic down to the socket layer! > > A few high level comments before going into CL specifics: > > (1) I am skeptical of the value of making HostResolverImpl a power observer. > (Discounting the async resolver in the justification below): > > * We can't properly cancel outstanding requests thanks to getaddrinfo() being > blocking on a worker thread. So at best we are abandoning the attempt threads. > As you point out, there may be a flurry of activity in cancelling and possibly > re-starting along these change edges, so we may not even properly accomplish > that. > > * We already have a baked in mechanism to re-try getaddrinfo() when things are > taking a long time. > > Given this, I would expect doing nothing is better. If the getaddrinfo() runs to > successful completion (yay). If it fails quickly after resume, that is fine too > (no worse than having aborted and failed). If it gets wedged and takes a while > to timeout after resume, then the hostresolverimpl should kick off another > attempt thread and win the race. > > (2) Should we even be doing this by default? > > We have a long history behind this unsavory hack, which goes back to working > around a blue screen of death that was happening on windows (long since fixed). > > Fundamentally it feels like brokenness in the host network implementation that > we are working around, which is why I wholeheartedly agree with putting it into > a layer close to that in our code, and trying to better contain the ugliness. > > I would love if we can better justify the concrete instances where we benefit > from this. There are definite downsides to doing this, which leads to a > frustrating user experience (same thing goes for ERR_NETWORK_CHANGED), which I > am not sure that we are making the right tradeoff on. > > Ideally we would only enable this for OS versions/configurations where we know > it to be a problem. If it truly is a real issue, then we may want to solve this > with a more general mechanism of identifying wedged sockets (maybe an aggressive > timeout on reads immediately following a power-mode resumption, but not actively > aborting stuff). > > Other applications (including IIRC Firefox) don't do these sorts of hacks, and I > would like to better understand what benefits we get from them. > > Thoughts? If we don't need it, I'm all for getting rid of it. That having been said, I had thought that live sockets could be effectively blackholed on resume. I thought I looked back in the history and found a reference to that being the reason for this behavior, though I'm having trouble finding it now. Note that behavior here potentially depends on both OS and network hardware. It's entirely possible for a socket to be still alive when we come out of suspend mode, in theory. And it's entirely possible for a router to have effectively blackholed a socket by that point, rendering an OS incapable of determining if the socket is still around. Unfortunately, my Windows laptop BSODs every time (Except when it just reboots, skipping the BSOD entirely) when I try to resume from suspend, so may take some time to test this. I'm skeptical of how robust our GetAddrInfo logic is across suspend mode - its retry timers are on the order of seconds (I think?), and it's entirely possible we're on the last retries. Also, cancelling a DNS request can be hard, because of the request merging behavior, so it's possible that trying to reload a stuck page won't actually cancel lookups, requiring a user to stare at a broken page for the duration of the final timeout. > https://codereview.chromium.org/2447083003/diff/100001/net/dns/host_resolver_... > File net/dns/host_resolver_impl.cc (right): > > https://codereview.chromium.org/2447083003/diff/100001/net/dns/host_resolver_... > net/dns/host_resolver_impl.cc:2472: AbortAllInProgressJobs(); > This is going to fail the requests with ERR_NETWORK_CHANGEd, and *not* > ERR_NETWORK_IO_SUSPENDED. Is that intentional? You're right, that's a bug.
Eric: Experimenting with the behavior on Windows, we get either a connection reset or connection closed, depending on whether there's pending data. Unfortunately, I think that means we still need some sort of failure here. That behavior does mean we don't have a connection hang, which is great, but in a world with HTTP/0.9 and "Connection: Close" responses, we don't want to incorrectly include a truncated resource in the cache. So given that, I think we want to go ahead with fixing up this CL and landing it?
FWIW, I recently learned that while Android devices *do* go to sleep, they do not deliver a signal to the application about this. As a result OnSuspend/OnRestore are never called on Android. You probably already knew this, but just as a heads up...
On 2016/11/16 15:22:19, Ryan Hamilton wrote: > FWIW, I recently learned that while Android devices *do* go to sleep, they do > not deliver a signal to the application about this. As a result > OnSuspend/OnRestore are never called on Android. You probably already knew this, > but just as a heads up... Thanks for the heads up! Yea, I did know that, but could very well not have. I only knew it because they used to be called when Chrome was backgrounded, causing problems, hence the switch to not calling them.
mmenke@chromium.org changed reviewers: - bnc@chromium.org, bnc@google.com, eroman@chromium.org, rch@chromium.org, rdsmith@chromium.org
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...) ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...) mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...) mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
The CQ bit was checked by mmenke@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: cast_shell_android on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/cast_shell_a...) |