Issue 2447083003: Move fail on suspend logic from URLRequestJob to TcpClientSocket.

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

4 years, 1 month ago (2016-10-25 16:23:15 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/1

4 years, 1 month ago (2016-10-25 16:23:45 UTC) #2

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 1 month ago (2016-10-25 16:42:17 UTC) #3

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_compile_dbg_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_compile_dbg_ng/builds/180130)

4 years, 1 month ago (2016-10-25 16:42:18 UTC) #4

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

4 years, 1 month ago (2016-10-25 16:47:52 UTC) #5

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/1

4 years, 1 month ago (2016-10-25 16:48:18 UTC) #6

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

4 years, 1 month ago (2016-10-25 17:30:37 UTC) #7

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/20001

4 years, 1 month ago (2016-10-25 17:31:23 UTC) #8

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 1 month ago (2016-10-25 17:50:26 UTC) #9

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/302319)

4 years, 1 month ago (2016-10-25 17:50:27 UTC) #10

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

4 years, 1 month ago (2016-10-25 18:09:26 UTC) #11

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/40001

4 years, 1 month ago (2016-10-25 18:10:15 UTC) #12

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 1 month ago (2016-10-25 18:30:03 UTC) #13

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/94010)

4 years, 1 month ago (2016-10-25 18:30:04 UTC) #14

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

4 years, 1 month ago (2016-10-25 18:45:23 UTC) #15

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/60001

4 years, 1 month ago (2016-10-25 18:45:52 UTC) #16

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 1 month ago (2016-10-25 19:15:20 UTC) #17

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/259800)

4 years, 1 month ago (2016-10-25 19:15:21 UTC) #18

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

4 years, 1 month ago (2016-10-25 20:34:18 UTC) #19

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/80001

4 years, 1 month ago (2016-10-25 20:34:52 UTC) #20

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 1 month ago (2016-10-25 21:39:06 UTC) #21

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/318829)

4 years, 1 month ago (2016-10-25 21:39:07 UTC) #22

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

4 years, 1 month ago (2016-10-26 18:15:36 UTC) #23

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/100001

4 years, 1 month ago (2016-10-26 18:16:13 UTC) #24

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 1 month ago (2016-10-26 19:30:57 UTC) #25

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 1 month ago (2016-10-26 19:30:58 UTC) #26

mmenke

mmenke@chromium.org changed reviewers: + bnc@google.com, eroman@chromium.org, rch@chromium.org, rdsmith@chromium.org

4 years, 1 month ago (2016-10-26 20:01:54 UTC) #27

mmenke

Hey guys, I don't actually want reviews here (I need to add a bunch of ...

4 years, 1 month ago (2016-10-26 20:01:54 UTC) #28

mmenke

Description was changed from ========== Move failing on suspend logic from URLRequestJob to TcpClientSocket. The ...

4 years, 1 month ago (2016-10-26 21:56:14 UTC) #29

Description was changed from

==========
Move failing on suspend logic from URLRequestJob to TcpClientSocket.

The old code had a couple issues:
1) It failed requests on entry to suspend mode, regardless of whether a
read was actually pending. This could confuse state machines that were
asynchronously processing partial results. Fixing this at the URLRequest
layer is made difficult due to the URLRequestJob's complicated state
machine. TcpClientSocket has a much simpler state machine.
2)  It failed requests with ERR_ABORTED, which normally means the
URLRequest was cancelled by something above the network stack, so is
not accurate in this instance. The RenderView also has magic handling
for this error code, unclear if that logic is desireable in this case.
If it is, the network stack should not be responsible for ensuring that
logic is triggered.
3)  This is more minor, but it also resulted in failing requests that
aren't going over th network.
4)  It only applied to URLRequests, and not other TCP sockets that may
be hung as a result of entering/exiting suspend mode.

This CL fixes 1-3, with fixing 4 a potential trivial followup to this.

The new logic is only enabled for requests that go through a socket
pool, and FTP sockets in this CL, with the HostResolver also being
updated to fail requests on suspend mode. This should mostly catch
the cases the old code caught, with a fair number of exceptions
(Doesn't cover QUIC, does cover WebSockets after a connection is
established, does cover a couple classes that don't talk HTTP but
still use socket pools to get proxy support, fails HTTP2 at the
session layer, doesn't try and drain HTTP sockets of data for reuse on
network changes).

BUG=651120
==========

to

==========
Move fail on suspend logic from URLRequestJob to TcpClientSocket.

The old code had a couple issues:
1) It failed requests on entry to suspend mode, regardless of whether a
read was actually pending. This could confuse state machines that were
asynchronously processing partial results. Fixing this at the URLRequest
layer is made difficult due to the URLRequestJob's complicated state
machine. TcpClientSocket has a much simpler state machine.
2)  It failed requests with ERR_ABORTED, which normally means the
URLRequest was cancelled by something above the network stack, so is
not accurate in this instance. The RenderView also has magic handling
for this error code, unclear if that logic is desireable in this case.
If it is, the network stack should not be responsible for ensuring that
logic is triggered.
3)  This is more minor, but it also resulted in failing requests that
aren't going over th network.
4)  It only applied to URLRequests, and not other TCP sockets that may
be hung as a result of entering/exiting suspend mode.

This CL fixes 1-3, with fixing 4 a potential trivial followup to this.

The new logic is only enabled for requests that go through a socket
pool, and FTP sockets in this CL, with the HostResolver also being
updated to fail requests on suspend mode. This should mostly catch
the cases the old code caught, with a fair number of exceptions
(Doesn't cover QUIC, does cover WebSockets after a connection is
established, does cover a couple classes that don't talk HTTP but
still use socket pools to get proxy support, fails HTTP2 at the
session layer, doesn't try and drain HTTP sockets of data for reuse on
network changes).

BUG=651120
==========

Ryan Hamilton

On 2016/10/26 20:01:54, mmenke wrote: > This code also removes the behavior from QUIC itself ...

4 years, 1 month ago (2016-10-26 22:00:07 UTC) #30

mmenke

On 2016/10/26 22:00:07, Ryan Hamilton wrote: > On 2016/10/26 20:01:54, mmenke wrote: > > This ...

4 years, 1 month ago (2016-10-26 22:05:38 UTC) #31

Ryan Hamilton

On 2016/10/26 22:05:38, mmenke wrote: > On 2016/10/26 22:00:07, Ryan Hamilton wrote: > > On ...

4 years, 1 month ago (2016-10-26 22:09:35 UTC) #32

mmenke

On 2016/10/26 22:09:35, Ryan Hamilton wrote: > On 2016/10/26 22:05:38, mmenke wrote: > > On ...

4 years, 1 month ago (2016-10-26 22:12:13 UTC) #33

Ryan Hamilton

On 2016/10/26 22:12:13, mmenke wrote: > On 2016/10/26 22:09:35, Ryan Hamilton wrote: > > On ...

4 years, 1 month ago (2016-10-26 22:16:09 UTC) #34

mmenke

On 2016/10/26 22:16:09, Ryan Hamilton wrote: > On 2016/10/26 22:12:13, mmenke wrote: > > On ...

4 years, 1 month ago (2016-10-26 22:18:04 UTC) #35

mmenke

On 2016/10/26 22:18:04, mmenke wrote: > On 2016/10/26 22:16:09, Ryan Hamilton wrote: > > On ...

4 years, 1 month ago (2016-10-26 22:18:37 UTC) #36

On 2016/10/26 22:18:04, mmenke wrote:
> On 2016/10/26 22:16:09, Ryan Hamilton wrote:
> > On 2016/10/26 22:12:13, mmenke wrote:
> > > On 2016/10/26 22:09:35, Ryan Hamilton wrote:
> > > > On 2016/10/26 22:05:38, mmenke wrote:
> > > > > On 2016/10/26 22:00:07, Ryan Hamilton wrote:
> > > > > > On 2016/10/26 20:01:54, mmenke wrote:
> > > > > > > This code also removes the behavior from QUIC itself (Which may be
> > able
> > > to
> > > > > > > properly handle this case...Hopefully?).
> > > > > > 
> > > > > > I'll think more about about the non-QUIC ramifications, but in the
> > > meantime
> > > > > one
> > > > > > thought about QUIC. The QUIC connection idle timeout is 30s (though
we
> > > have
> > > > a
> > > > > > 5min experiment running). If the suspend period is longer than that
> then
> > > the
> > > > > > server will already have closed the connection. Do you have any idea
> how
> > > > long
> > > > > > periods of suspend are typically?
> > > > > 
> > > > > I'd guess on the order of 10 minutes to hours or days.  Less than 30
> > seconds
> > > > is
> > > > > certainly possible, if someone realizes they wanted to do something on
> > their
> > > > PC
> > > > > right after closing the lid.
> > > > 
> > > > if that's the case, then there's probably not much value in keeping QUIC
> > > > sessions alive. :/
> > > 
> > > I'm happy to hook up QUIC sessions to the signal, too, if they need it
> > > (PostDelayedTask looks at CPU ticks, not wall time, so I think any timer
it
> > uses
> > > wouldn't trigger on resume).
> > 
> > Really?! Ugh. Quic's idle timeout detection (and all of QUIC's internal
> > bookkeeping) is based on base::TimeTicks::Now(). If the computer goes to
> sleep,
> > does that mean that TimeTicks don't move forward during that period? If so,
> that
> > sounds excitingly problematic. Maybe I'm over thinking this.... let me try
to
> > think what ramifications this might have.
> 
> Hrm...Looks like I may be wrong about that.

Actually, I'm not sure.  Worth experimenting with, if you're strongly depending
on one behavior there.

Bence

https://codereview.chromium.org/2447083003/diff/100001/net/socket/stream_socket.h File net/socket/stream_socket.h (right): https://codereview.chromium.org/2447083003/diff/100001/net/socket/stream_socket.h#newcode118 net/socket/stream_socket.h:118: // connected to remote systems. See https://crbug.com/4606). Unmatched ')'.

4 years, 1 month ago (2016-10-27 13:04:29 UTC) #38

Randy Smith (Not in Mondays)

So it won't surprise you to hear that I'm in favor of this CL; removing ...

4 years, 1 month ago (2016-10-27 15:51:11 UTC) #39

Randy Smith (Not in Mondays)

On 2016/10/27 15:51:11, Randy Smith - Not in Mondays wrote: > So it won't surprise ...

4 years, 1 month ago (2016-10-27 16:04:23 UTC) #40

On 2016/10/27 15:51:11, Randy Smith - Not in Mondays wrote:
> So it won't surprise you to hear that I'm in favor of this CL; removing
> asynchronous, uncontrolled by the code events in favor of funneling errors
> through well-tested error handling pathways SGTM.   
> 
> I wouldn't expect functionality problems from the addition of putting failure
in
> at the socket layer, since all clients need to be able to handle sockets being
> closed on them asynchronously anyway.  So to me that leaves the following
> possible sources of problems:
> 
> * Things working that didn't before.  I'm thinking specifically of URLRequests
> that are being serviced from cache surviving fine across a suspend.  I don't
see
> any reason why this would cause problems, but I raise it as an expected change
> in behavior.
> 
> * Things that are part of the URLRequestJob pathway that aren't part of the
TCP
> socket failure pathway.  QUIC's obvious here, and you and Ryan are figuring
that
> out.  And you've named several other possibilities.  I'm not sure I'm going to
> be of help here, as this is about knowing corner cases of the code, and none
> come to me that you haven't mentioned.
> 
> * Things that are relying on the push notification of failure for good
behavior,
> and will not behave as well on pull.  The only thing that comes to mind here
is
> "hanging gets".  How does this change deal with outstanding async reads?
> 
> Two other thoughts:
> * As I say above, I wouldn't expect problems from the socket change, only from
> the URLRequestJob change.  While I'm sure I'm wrong :-}, that suggests that
> possibility of making the socket change global and seeing how hard it is to
> flush out the problems it produces.  Do you expect problems from just the
socket
> change?
> 
> * The above thought makes me wondering about splitting this CL in two, one for
> the socket change and one for the URLRequestJob change.  This would require
> making sure that both of them being sensitive to suspends didn't have any bad
> synergistic effects.
> 
> FWIW.

Oh, I'll also add: Is there any way (probably in a different CL) to postpone the
error until the kernel buffer has been drained of any data that arrived before
the suspend?  Maybe a read with O_NONBLOCK?

mmenke

On 2016/10/27 16:04:23, Randy Smith - Not in Mondays wrote: > On 2016/10/27 15:51:11, Randy ...

4 years, 1 month ago (2016-10-27 17:22:50 UTC) #41

On 2016/10/27 16:04:23, Randy Smith - Not in Mondays wrote:
> On 2016/10/27 15:51:11, Randy Smith - Not in Mondays wrote:
> > So it won't surprise you to hear that I'm in favor of this CL; removing
> > asynchronous, uncontrolled by the code events in favor of funneling errors
> > through well-tested error handling pathways SGTM.   
> > 
> > I wouldn't expect functionality problems from the addition of putting
failure
> in
> > at the socket layer, since all clients need to be able to handle sockets
being
> > closed on them asynchronously anyway.  So to me that leaves the following
> > possible sources of problems:
> > 
> > * Things working that didn't before.  I'm thinking specifically of
URLRequests
> > that are being serviced from cache surviving fine across a suspend.  I don't
> see
> > any reason why this would cause problems, but I raise it as an expected
change
> > in behavior.
> > 
> > * Things that are part of the URLRequestJob pathway that aren't part of the
> TCP
> > socket failure pathway.  QUIC's obvious here, and you and Ryan are figuring
> that
> > out.  And you've named several other possibilities.  I'm not sure I'm going
to
> > be of help here, as this is about knowing corner cases of the code, and none
> > come to me that you haven't mentioned.
> > 
> > * Things that are relying on the push notification of failure for good
> behavior,
> > and will not behave as well on pull.  The only thing that comes to mind here
> is
> > "hanging gets".  How does this change deal with outstanding async reads?

If there are outstanding reads/writes/connects, this will fail them (If the
socket is destroyed/disconnected after it fails one, before it fails the other,
it will just fail one of them).  It will also fail all new reads/writes (But not
connects).  Once the socket is disconnected, or connected again without
disconnecting, the socket will work as normal (Not sure if connecting against
without disconnecting is expected to work - API isn't clear on that)

> > Two other thoughts:
> > * As I say above, I wouldn't expect problems from the socket change, only
from
> > the URLRequestJob change.  While I'm sure I'm wrong :-}, that suggests that
> > possibility of making the socket change global and seeing how hard it is to
> > flush out the problems it produces.  Do you expect problems from just the
> socket
> > change?

My main concern with the socket change is with connections to localhost.  I'd
like to apply it just to some sockets at first, see how things go, and then if
that goes well apply it to all sockets a few releases later, possibly with a
localhost exemption.

> > * The above thought makes me wondering about splitting this CL in two, one
for
> > the socket change and one for the URLRequestJob change.  This would require
> > making sure that both of them being sensitive to suspends didn't have any
bad
> > synergistic effects.

This makes sense...I don't think all synergistic effects can be avoided (Just
consider requests that, on failure, trigger other requests, that reach the DNS
layer, and vice versa, though that kind of issue at the DNS layer seems
unavoidable, even if we do it all at once).  I think, at worse, we'd be failing
extra requests that haven't yet actually reached the network layer when going
into suspend mode, which I don't think is a huge deal.

> > FWIW.
> 
> Oh, I'll also add: Is there any way (probably in a different CL) to postpone
the
> error until the kernel buffer has been drained of any data that arrived before
> the suspend?  Maybe a read with O_NONBLOCK?

Hrm...I could try delaying the error until an operation doesn't complete
synchronously, but that sounds unlikely to work well, on full duplex sockets. 
Suppose we can read data, but there's only a write pending.  Do we wait for the
consumer to read again?  What if they don't?  An arbitrary timeout?

I don't think I'm going to have something ready to land in the next 3 weeks
(Before branch), so I'm thinking maybe switch the URLRequestJob logic to return
ERR_NETWORK_IO_SUSPENDED on entering suspend mode, in the meantime, to make sure
at least that change is safe, and so we effectively land fewer changes at once.

eroman

> If the computer goes to sleep, does that mean that TimeTicks don't move forward ...

4 years, 1 month ago (2016-10-27 18:33:52 UTC) #43

eroman

Thanks Matt, I am very supportive of moving this logic down to the socket layer! ...

4 years, 1 month ago (2016-10-27 19:26:26 UTC) #44

Thanks Matt, I am very supportive of moving this logic down to the socket layer!

A few high level comments before going into CL specifics:

(1) I am skeptical of the value of making HostResolverImpl a power observer.
(Discounting the async resolver in the justification below):

  * We can't properly cancel outstanding requests thanks to getaddrinfo() being
blocking on a worker thread. So at best we are abandoning the attempt threads.
As you point out, there may be a flurry of activity in cancelling and possibly
re-starting along these change edges, so we may not even properly accomplish
that.

 * We already have a baked in mechanism to re-try getaddrinfo() when things are
taking a long time.

Given this, I would expect doing nothing is better. If the getaddrinfo() runs to
successful completion (yay). If it fails quickly after resume, that is fine too
(no worse than having aborted and failed). If it gets wedged and takes a while
to timeout after resume, then the hostresolverimpl should kick off another
attempt thread and win the race.

(2) Should we even be doing this by default?

We have a long history behind this unsavory hack, which goes back to working
around a blue screen of death that was happening on windows (long since fixed).

Fundamentally it feels like brokenness in the host network implementation that
we are working around, which is why I wholeheartedly agree with putting it into
a layer close to that in our code, and trying to better contain the ugliness.

I would love if we can better justify the concrete instances where we benefit
from this. There are definite downsides to doing this, which leads to a
frustrating user experience (same thing goes for ERR_NETWORK_CHANGED), which I
am not sure that we are making the right tradeoff on.

Ideally we would only enable this for OS versions/configurations where we know
it to be a problem. If it truly is a real issue, then we may want to solve this
with a more general mechanism of identifying wedged sockets (maybe an aggressive
timeout on reads immediately following a power-mode resumption, but not actively
aborting stuff).

Other applications (including IIRC Firefox) don't do these sorts of hacks, and I
would like to better understand what benefits we get from them.

Thoughts?

https://codereview.chromium.org/2447083003/diff/100001/net/dns/host_resolver_...
File net/dns/host_resolver_impl.cc (right):

https://codereview.chromium.org/2447083003/diff/100001/net/dns/host_resolver_...
net/dns/host_resolver_impl.cc:2472: AbortAllInProgressJobs();
This is going to fail the requests with ERR_NETWORK_CHANGEd, and *not*
ERR_NETWORK_IO_SUSPENDED. Is that intentional?

mmenke

On 2016/10/27 19:26:26, eroman wrote: > Thanks Matt, I am very supportive of moving this ...

4 years, 1 month ago (2016-10-27 20:09:44 UTC) #45

On 2016/10/27 19:26:26, eroman wrote:
> Thanks Matt, I am very supportive of moving this logic down to the socket
layer!
> 
> A few high level comments before going into CL specifics:
> 
> (1) I am skeptical of the value of making HostResolverImpl a power observer.
> (Discounting the async resolver in the justification below):
> 
>   * We can't properly cancel outstanding requests thanks to getaddrinfo()
being
> blocking on a worker thread. So at best we are abandoning the attempt threads.
> As you point out, there may be a flurry of activity in cancelling and possibly
> re-starting along these change edges, so we may not even properly accomplish
> that.
> 
>  * We already have a baked in mechanism to re-try getaddrinfo() when things
are
> taking a long time.
> 
> Given this, I would expect doing nothing is better. If the getaddrinfo() runs
to
> successful completion (yay). If it fails quickly after resume, that is fine
too
> (no worse than having aborted and failed). If it gets wedged and takes a while
> to timeout after resume, then the hostresolverimpl should kick off another
> attempt thread and win the race.
> 
> (2) Should we even be doing this by default?
> 
> We have a long history behind this unsavory hack, which goes back to working
> around a blue screen of death that was happening on windows (long since
fixed).
> 
> Fundamentally it feels like brokenness in the host network implementation that
> we are working around, which is why I wholeheartedly agree with putting it
into
> a layer close to that in our code, and trying to better contain the ugliness.
> 
> I would love if we can better justify the concrete instances where we benefit
> from this. There are definite downsides to doing this, which leads to a
> frustrating user experience (same thing goes for ERR_NETWORK_CHANGED), which I
> am not sure that we are making the right tradeoff on.
> 
> Ideally we would only enable this for OS versions/configurations where we know
> it to be a problem. If it truly is a real issue, then we may want to solve
this
> with a more general mechanism of identifying wedged sockets (maybe an
aggressive
> timeout on reads immediately following a power-mode resumption, but not
actively
> aborting stuff).
> 
> Other applications (including IIRC Firefox) don't do these sorts of hacks, and
I
> would like to better understand what benefits we get from them.
> 
> Thoughts?

If we don't need it, I'm all for getting rid of it.  That having been said, I
had thought that live sockets could be effectively blackholed on resume.  I
thought I looked back in the history and found a reference to that being the
reason for this behavior, though I'm having trouble finding it now.

Note that behavior here potentially depends on both OS and network hardware. 
It's entirely possible for a socket to be still alive when we come out of
suspend mode, in theory.  And it's entirely possible for a router to have
effectively blackholed a socket by that point, rendering an OS incapable of
determining if the socket is still around.

Unfortunately, my Windows laptop BSODs every time (Except when it just reboots,
skipping the BSOD entirely) when I try to resume from suspend, so may take some
time to test this.

I'm skeptical of how robust our GetAddrInfo logic is across suspend mode - its
retry timers are on the order of seconds (I think?), and it's entirely possible
we're on the last retries.  Also, cancelling a DNS request can be hard, because
of the request merging behavior, so it's possible that trying to reload a stuck
page won't actually cancel lookups, requiring a user to stare at a broken page
for the duration of the final timeout.

>
https://codereview.chromium.org/2447083003/diff/100001/net/dns/host_resolver_...
> File net/dns/host_resolver_impl.cc (right):
> 
>
https://codereview.chromium.org/2447083003/diff/100001/net/dns/host_resolver_...
> net/dns/host_resolver_impl.cc:2472: AbortAllInProgressJobs();
> This is going to fail the requests with ERR_NETWORK_CHANGEd, and *not*
> ERR_NETWORK_IO_SUSPENDED. Is that intentional?

You're right, that's a bug.

mmenke

Eric: Experimenting with the behavior on Windows, we get either a connection reset or connection ...

4 years, 1 month ago (2016-11-16 15:17:40 UTC) #46

Ryan Hamilton

FWIW, I recently learned that while Android devices *do* go to sleep, they do not ...

4 years, 1 month ago (2016-11-16 15:22:19 UTC) #47

mmenke

4 years, 1 month ago (2016-11-16 15:53:47 UTC) #48

On 2016/11/16 15:22:19, Ryan Hamilton wrote:
> FWIW, I recently learned that while Android devices *do* go to sleep, they do
> not deliver a signal to the application about this. As a result
> OnSuspend/OnRestore are never called on Android. You probably already knew
this,
> but just as a heads up...

Thanks for the heads up!  Yea, I did know that, but could very well not have.  I
only knew it because they used to be called when Chrome was backgrounded,
causing problems, hence the switch to not calling them.

mmenke

mmenke@chromium.org changed reviewers: - bnc@chromium.org, bnc@google.com, eroman@chromium.org, rch@chromium.org, rdsmith@chromium.org

4 years ago (2016-12-02 15:28:20 UTC) #49

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

3 years, 6 months ago (2017-05-30 18:25:31 UTC) #50

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/160001

3 years, 6 months ago (2017-05-30 18:26:02 UTC) #51

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 6 months ago (2017-05-30 18:29:22 UTC) #52

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/226383) ios-simulator-xcode-clang on ...

3 years, 6 months ago (2017-05-30 18:29:23 UTC) #53

mmenke

The CQ bit was checked by mmenke@chromium.org to run a CQ dry run

3 years, 4 months ago (2017-08-14 20:22:19 UTC) #54

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2447083003/180001

3 years, 4 months ago (2017-08-14 20:22:34 UTC) #55

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 4 months ago (2017-08-14 20:26:24 UTC) #56

commit-bot: I haz the power

3 years, 4 months ago (2017-08-14 20:26:25 UTC) #57

Dry run: Try jobs failed on following builders:
  cast_shell_android on master.tryserver.chromium.android (JOB_FAILED,
https://build.chromium.org/p/tryserver.chromium.android/builders/cast_shell_a...)

Issue 2447083003: Move fail on suspend logic from URLRequestJob to TcpClientSocket.

Description

Patch Set 1 #

Patch Set 2 : Fix NaCl #

Patch Set 3 : Really fix NaCl??? #

Patch Set 4 : Fix callbacks #

Patch Set 5 : Skip test #

Patch Set 6 : Fix connect error case #

Patch Set 7 : Merge #

Patch Set 8 : Not rightly sure what this is... #

Patch Set 9 : Fix #

Patch Set 10 : . #

Messages