PvE lag topic

c45y · September 28, 2015

So, as you all know PvE is faced with a bit of a lag problem, currently we have tried a lot of things but I am personally running out of ideas. Now you are all smart people, and full of great ideas, and we are very much open to any suggestion you may have.

Below is a braindump of the current state, and will be updated as we progress. Thank you very much for your understanding as we work through this, and any assistance you can provide.

Problem: Shits laggy yo, currently unknown trigger or steps to reproduce, appears to be related to people joining the server in some instances but not always.

State:

Ram: 16GB

Build: 1.8.7 spigot

Plugins: (44): AdminHunt, AsyncWorldEdit, AsyncWorldEditInjector, BookExploitFix, bPermissions, ClientPermissions, ClockBlocked, CobraCorral, CommandBlock, CommandHelper, Doppelganger, dynmap, KeepBabyMobs, KitchenSink, LogBlock, LolNeigh, LolNo, LWC, MaskOfFutures, MobLimiter, ModMode, ModReq, Multiverse-Core, NerdFlags, NerdList, NerdMessage, NerdSpawn, NoCheatPlus, OpenInv, PacketWrapper, PlumpXP, ProperTime, ProtocolLib, RedstoneClockDetector, SafeBuckets, SBC, StandMaster9000, SureLog, TPControl, VanishNoPacket, WGCustomFlags, WorldBorder, WorldEdit, WorldGuard

Versions: Changed between this rev and last are underlined

Spigot 1.8 git-Spigot-550ebac-7019900 -> 1.8.7 git-Spigot-f928e7a-e91aed8
bPermissions v2.10.9m -> v2.12
zombieKill v1.0.4 -> #
SBC v0.2.1 -> v1.2.2
AsyncWorldEditInjector v2.1.3 -> v2.1.3
SureLog v0.1 -> v0.1
WorldEdit v6.0.2-SNAPSHOT;no_git_id -> v6.1.1-SNAPSHOT;3394-9a65bdb7
MobLimiter v1.1.3 -> v1.1.3
KeepBabyMobs v0.1.2 -> v0.1.2
StandMaster9000 v0.0.3 -> v0.0.3
Multiverse-Core v2.5-b691 -> v2.5-b697
LogBlock v1.94-dev-SNAPSHOT (build #256) -> v1.94-dev-SNAPSHOT (build #260)
LWC v4.5.0-SNAPSHOT -> v4.5.0-SNAPSHOT
VillageMarker v0.33 -> #
WorldGuard v6.0.0-SNAPSHOT.1583- -> v6.1.1-SNAPSHOT.1628-e28b236
CommandBlock v2.0 -> v2.0
CobraCorral v1.2 -> v1.2.2
dynmap v2.1-SNAPSHOT-1091 -> v2.1-SNAPSHOT-1091
NerdList v0.1 -> v0.2
ModReq v1.6 -> v1.6
NerdSpawn v2.0 -> v2.0
AdminHunt v1.0-1.6.4 -> v1.0-1.6.4
LolNeigh v1.0.0 -> v1.0.0
LolNo v0.4 -> v0.4
Doppelganger v0.9.2 -> v0.9.2
BookExploitFix v0.5 -> v0.6
ClientPermissions v1.1.0 -> v1.1.0
NerdMessage v0.5 -> v0.5
SignKitten v0.1 -> #
OpenInv v2.2.8 -> v2.3.4
SafeBuckets v0.7.1 -> v0.7.2
TPControl v0.3 -> v0.4.2
ProtocolLib v3.6.3-SNAPSHOT -> v3.6.4
WGCustomFlags v1.8.1-beta -> v1.8.1-beta
WorldBorder v1.7.9 -> v1.7.9
NerdFlags v0.1.1 -> v0.1.4
AsyncWorldEdit v2.1.3 -> v2.1.5
NoCheatPlus v3.12.0-SNAPSHOT-sMD5NET-b793 -> v3.12.1-SNAPSHOT-sMD5NET-b849
CommandHelper v3.3.1-SNAPSHOT.2833- -> v3.3.1-SNAPSHOT.2910-
VanishNoPacket v3.19.1 -> v3.19.1
ModMode v3.5.0 -> v3.5.0
KitchenSink v0.9.2 -> v0.9.4
NerdGreeting v2.0.0 -> #
MaskOfFutures v0.10.3.1 -> v0.10.3.1
PlumpXP # -> v0.4
ProperTime # -> v2.2
RedstoneClockDetector # -> v0.2.7-b12
ClockBlocked # -> v1.0.0

Things tested:

Running with NoCheatPlus disabled
Running with CommandHelper disabled
Running with PlumpXP disabled
Running with MobLimiter disabled
Migrated LWC from sqlite to mysql
Reverted bPermissions version to last stable, lag got worse
Reviewed redstone activity with RedstoneClockDetector, appears normal
Reviewed spigot timings reports @todo generate and attach during peak times
Bypassed BungeeCord connecting to PvE directly, still lag spikes

Observations:

Lag seems related to login events ( by extension chunk loads )
reverting bPermissions from v2.12 to v2.10 increased lag noticeably, needs a review
Teleporting to unloaded chunks sometimes triggers a lag spike
Issue could not be triggered using two accounts on an alternate server, seems to be tied to in game actions/number of people online
Opening chests appears to trigger lag occasionally, though not consistently

Edited September 29, 2015 by c45y
include spigot build info

totemo · September 29, 2015

The lag spikes appear to be a stop-the-world type phenomenon. By bumping up the RAM usage we eliminated a big garbage collection pause on P as a possible cause.

As I said at the time, I don't think the individual bypassing of BungeeCord was a complete test, since BungeeCord is a shared resource for all players. If BungeeCord can in some way hold up or disrupt the main game thread then other players on Bungee can still cause issues.

Today, I see an admincraft post about this:

* https://www.reddit.com/r/admincraft/comments/3miu6u/bungeecord_memory_usage_skyrocketing/

* https://www.spigotmc.org/threads/running-bungeecord-with-2gb-of-memory-or-less-with-g1gc.91807/#post-1017207

Indicating that BungeeCord needs a lot of RAM in recent versions and also needs particular GC flags to the JVM. I found out when I sat in on that testing that we're on default args for the BungeeCord JVM. I'd say that's an avenue of investigation. It's possible that there are multiple issues and that BungeeCord is an amplifier of any lag that does come up, as indicated in the post. It would be consistent with the occasional lag around chunk loading when teleporting too.

One counter to theories along this line was that neither C nor S (when it existed) was seeing these issues.

I think the main lag situations appear to be on logins and saves. As I said last night (my timezone) I do wonder about the name API but I haven't been able to specifically trigger a noticeable latency issue with it. It wouldn't explain the chunk loading anyway.

totemo · September 29, 2015

Get rid of SureLog. It was just experimental stuff and absolultely doesn't need to be there. I doubt (hope) it's not having any effect. :)

totemo · September 29, 2015

Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that. Curious about Multiverse.

And further questions:

* NerdList is in the changed list. Can we get a source code link?

tompreuss · September 29, 2015

Versions:

For folks who might know what to look for and/or just want to browse the actual changes to the listed plugins, I've added relevant links to code diffs where possible:

Spigot 1.8 -> 1.8.7
- Do you have build numbers? "git-Spigot-07a5b1b-04afd79 (MC: 1.8.7)" or somesuch would allow for pin pointing commits.
  - Spigot 1.8 git-Spigot-550ebac-7019900 -> 1.8.7 git-Spigot-f928e7a-e91aed8
    - Atlassian doesn't allow for commit diff links as far as I can tell, so the best single link I can supply is to the 550ebac to current HEAD changes, so just only look at the diffs from the bottom of the table up to the f928e7aa73b 21-Jun-2015 commit.
bPermissions v2.10.9m -> v2.12
- v2.12 is still beta, github doesn't have latest beta commits
SBC v0.2.1 -> v1.2.2
- github diff
WorldEdit v6.0.2-SNAPSHOT;no_git_id -> v6.1.1-SNAPSHOT;3394-9a65bdb7
- github diff
Multiverse-Core v2.5-b691 -> v2.5-b697
- github diff
LogBlock v1.94-dev-SNAPSHOT (build #256) -> v1.94-dev-SNAPSHOT (build #260)
- github diff
WorldGuard v6.0.0-SNAPSHOT.1583- -> v6.1.1-SNAPSHOT.1628-e28b236
- github diff
CobraCorral v1.2 -> v1.2.2
- github diff
NerdList v0.1 -> v0.2
- Not open source?
BookExploitFix v0.5 -> v0.6
- Not open source, has only changelog.
OpenInv v2.2.8 -> v2.3.4
- github diff
SafeBuckets v0.7.1 -> v0.7.2
- github diff
TPControl v0.3 -> v0.4.2
- github diff
ProtocolLib v3.6.3-SNAPSHOT -> v3.6.4
- github diff (best guess based on dates and version numbers supplied)
NerdFlags v0.1.1 -> v0.1.4
- github diff
AsyncWorldEdit v2.1.3 -> v2.1.5
- github diff
NoCheatPlus v3.12.0-SNAPSHOT-sMD5NET-b793 -> v3.12.1-SNAPSHOT-sMD5NET-b849
- github diff
CommandHelper v3.3.1-SNAPSHOT.2833- -> v3.3.1-SNAPSHOT.2910-
- github diff
KitchenSink v0.9.2 -> v0.9.4
- github diff
PlumpXP # -> v0.4
- New, but following changes made since use on S: github diff
ProperTime # -> v2.2
- New to P, no changes since use on S, github
RedstoneClockDetector # -> v0.2.7-b12
- New, github
ClockBlocked # -> v1.0.0
- New, github

----

(edits: fixing links)

(edit to add link to spigot changes due to edit in parent)

Edited September 29, 2015 by tompreuss

c45y · September 29, 2015

Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that. Curious about Multiverse.

And further questions:

* NerdList is in the changed list. Can we get a source code link?

Up to the original author, I'll let them decide ( have pinged on irc )

--

Have update to include spigot build info

--

Surelog is removed, no idea what it was anyway

--

While I am surprised it could be bungee after experiencing the same issues while not connected through it it does look like there is some intense heap clearing going on, I'll have a read of those threads and make some changes once im finished here at work

-- Edit

BungeeCord is now running with the specified args

Edited September 29, 2015 by c45y

c45y · September 29, 2015

Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that. Curious about Multiverse.

And further questions:

* NerdList is in the changed list. Can we get a source code link?

From the original topic,

Running with NoCheatPlus disabled
Running with CommandHelper disabled
Running with PlumpXP disabled
Running with MobLimiter disabled

multiverse has not been disabled in running, though in theory we could do that live and just drop the admin test world, will test later tonight

totemo · September 29, 2015

It seemed like the changed JVM flags for BungeeCord had a pronounced negative effect on latency. There was a considerable increase in in the amount of NCP spam about "too many packets". Although the effect was detrimental, it's interesting that there was an effect. It will be interesting to see what the heap is doing.

One interesting fact that's apparent from the graph you posted: we're churning through about 80MB of young generation heap every minute. And then after 45 minutes or so, we run out of headroom and the GC clears up the old generation.

I'm not sure if those are the old GC settings or the new. But I do wonder, how long is that young generation GC taking, in particular: in comparison to a tick (50ms). According to this article G1 will give us lower latency collections than CMS which is what we would want, I guess.

Looking through Oracle's docs, I notice that the default goal for a GC pause with G1 is 200ms. That's 4 ticks. Could we reduce that somewhat drastically: -XX:MaxGCPauseMillis=25? The third heading at that page, "The G1 Garbage Collector", goes into detail on how G1 operates and differs from CMS. According to this, we should be able to achieve 10ms collections for BungeeCord. We might need to run with a fair bit more headroom though.

The G1 command line options are summarised rather well here, bottom of the page.

c45y · September 29, 2015

BungeeCord has been modified to have a 1G heap, as well as using maxgcpausemillis set to 50, visualvm heap graphs will be taken during higher usage periods tomorrow to compare.

Currently lag appears to be triggered by Join/Quit and sometimes chest access.

roastnewt · September 29, 2015

Are you running Java 7 or 8? They made some changes re: GC as discussed above.

c45y · September 29, 2015

It appears java 7 is our environment for the servers.

PvE and BungeeCord have been updated to use java8, pve seems to have a nicer heap graph, while bungee still has quite a spikey map

Edited September 30, 2015 by c45y

redwall_hp · September 30, 2015

I just lowered -XX:G1HeapRegionSize to 4M. C45Y's still running the graph, but initial report sounds like it's smoother.

redwall_hp · September 30, 2015

Here are Spigot timings with around 50 players online: http://timings.aikar.co/?url=12622106

Note that the lag spikes occur even when there are more like 10 players on, when most of the flagged values are reasonably low. (Will post a timing from off-peek hours later.)

Edit: And here's c45y's latest graph of the BungeeCord GC.

totemo · September 30, 2015

You are using DynMap, and its rendering is causing you a decent amoung of lag due to it loading chunks.

I'm not sure what triggers that particular message or where they learned to spell "amount" but can we run for a while without DynMap to see what effect that has.

That latest GC graph for BungeeCord shows memory usage spikes of about 175 MB, e.g. at about 12:08, 12:18, 12:23, 12:28. Do they correlate with saves? Where do the lag spikes happen on that time scale? What would be causing bungee to suddenly buffer so much stuff? It seems to be instantaneous, rather than following the preceding slope of the graph.

Makes me wonder about TCP receive buffer tuning: http://www.cyberciti.biz/faq/linux-tcp-tuning/. I'm not sure that the graph really points to that. Just throwing it in there in case it triggers any thoughts.

tompreuss · September 30, 2015

I'm not sure what triggers that particular message or where they learned to spell "amount"

From aikar/timings/legacy/index.php:

if ($event == "Task: Unknown(Single)" && substr($plugin, 0, 6) == "dynmap" && $pct_tot_raw > 0.005) {
    $recommendations[] = "<b>You are using DynMap, and its rendering is causing you a decent amoung of lag due to it loading chunks.</b>";
}

redwall_hp · September 30, 2015

I'm not sure what triggers that particular message or where they learned to spell "amount" but can we run for a while without DynMap to see what effect that has.

We ran it with dynmap disabled for a few days before the livemap was made public. It didn't make a difference to the lag spikes.

redwall_hp · September 30, 2015

This timing is from a few nights ago, during off-peak hours: http://timings.aikar.co/?url=12601271

12 players online, still lag spikes.

totemo · September 30, 2015

Next thing to try, if relevant: -XX:+PerfDisableSharedMem

From TL;DR: http://www.evanjones.ca/jvm-mmap-pause.html

redwall_hp · September 30, 2015

Since there have been observations that spikes may correlate with players joining, more specifically joining on unloaded chunks, here's a list of every plugin that registers an on-join listener. (As reported by a plugin with an on-join handler with MONITOR priority.)

ProtocolLib, bPermissions, SBC, CommandHelper, OpenInv, NoCheatPlus, NoCheatPlus, NoCheatPlus, NoCheatPlus, VanishNoPacket, Multiverse-Core, WorldGuard, WorldGuard, CobraCorral, ModReq, NerdSpawn, CommandHelper, AsyncWorldEdit, ModMode, VanishNoPacket, LolNo, LWC, ProtocolLib, LogBlock, NerdList, dynmap, dynmap, notifyshit, KitchenSink

And the quit listeners:

CommandHelper, ModMode, KitchenSink, Multiverse-Core, CobraCorral, CommandHelper, CommandHelper, TPControl, AsyncWorldEdit, VanishNoPacket, LolNo, ClientPermissions, LWC, ProtocolLib, StandMaster9000, LogBlock, dynmap, notifyshit, OpenInv, NoCheatPlus, NoCheatPlus

c45y · September 30, 2015

Bungee has got some new flags that actually make sense for our environment

PvE also got some new flags, which results in a much nicer heap

These were taken at relatively low usage periods, with only 13 people on P and 1 on C, will be taken again at high usage times to compare.

We are now confident the issue is not with our GC, but with a plugin as originally investigated. Work will continue on tracking it down and figuring out why it does not show up in timing reports

c45y · October 1, 2015

Issue confirmed as CobraCorral handling of ChunkUnloadEvent, ~~specific cause not found~~, we are reviewing the source now and reaching out to the plugin author

With more debugging we have narrowed the issue to what appears to be async database updates. However for whatever reason each locked horse per chunk unload takes 1000ms to complete the following function

if(utils.isHorseLocked((Horse)entity)) {
    LockedHorse lhorse = config.Database.getHorse(((Horse)entity).getUniqueId());
    utils.updateHorse(lhorse, (Horse)entity);
}

https://gist.github.com/stevommmm/0fa7948c2fabb60fb904

At this stage my theory is that https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/Database.java#L97 is a blocking operation, however I can't seem to make sense of the javadocs at 1am

Debug build has confirmed that the call to getHorses is the issue

https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/CorralListener.java#L586

https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/Database.java#L196

For some reason horseCache is not being populated, so every time we ask Database for a horse it is querying the database on the main thread ( thus lag )

Edited October 2, 2015 by c45y

tompreuss · October 1, 2015

Issue confirmed as CobraCorral handling of ChunkUnloadEvent, ~~specific cause not found~~, we are reviewing the source now and reaching out to the plugin author

With more debugging we have narrowed the issue to what appears to be async database updates. However for whatever reason each locked horse per chunk unload takes 1000ms to complete the following function
if(utils.isHorseLocked((Horse)entity)) {
    LockedHorse lhorse = config.Database.getHorse(((Horse)entity).getUniqueId());
    utils.updateHorse(lhorse, (Horse)entity);
}
https://gist.github.com/stevommmm/0fa7948c2fabb60fb904

Note for casual readers that what is seen in that log paste is not "each locked horse per chunk unload" on the server taking more than 1000ms, it's only the locked horses that are taking more than 1000ms that are shown:

        Long duration = System.currentTimeMillis() - begin;
        if (duration > 1) {
            this.plugin.getLogger().log(Level.INFO, "ChunkUnloadEvent; duration={0}", duration);
        }

All the horses that aren't taking longer than 1000ms wouldn't be shown.

CobraCorral v1.2 -> v1.2.2
github diff

I don't see any changes to onChunkUnload() here; additionally this function has been there since March of 2014, why hasn't it manifested this problem until now at the new rev start?

totemo · October 1, 2015

Just to deepen the mystery even further, C45Y:

CobraCorral v1.2.2 Total: 0.006 s	Pct: 0.01%
   Pct Total	Pct Tick	   Total	      Avg	 PerTick	     Count		Event
       0.01%	   0.01%	  0.00 s	     0.01 ms	 0.5	       0.6k	    CorralListener::onChunkUnload(ChunkUnloadEvent)
       0.00%	   2.52%	  0.00 s	     1.26 ms	 0.0	       0.0k	    CorralListener::onPlayerJoin(PlayerJoinEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.2	       0.3k	    CorralListener::onEntityDamage(EntityDamageEvent)
       0.00%	   0.02%	  0.00 s	     0.01 ms	 0.0	       0.0k	    CorralListener::onInventoryClose(InventoryCloseEvent)
       0.00%	   0.42%	  0.00 s	     0.21 ms	 0.0	       0.0k	    CorralListener::onPlayerInteractEntity(PlayerInteractEntityEvent)
       0.00%	   0.01%	  0.00 s	     0.00 ms	 0.0	       0.1k	    CorralListener::onInventoryClick(InventoryClickEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onEntityDamageByEntity(EntityDamageByEntityEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onInventoryOpen(InventoryOpenEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onEntityDeath(EntityDeathEvent)
       0.00%	   0.01%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onInventoryDrag(InventoryDragEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onVehicleExit(VehicleExitEvent)

onChunkUnload() was called about 600 times, averaging 0.01ms per call in one of the more recent timings that you guys linked: http://timings.aikar.co/?url=12629871. You were saying that you saw an 11,000ms chunk unload event. It's really strange that this doesn't show up on the timings. 11000ms/600 = 18.3 ms. A single bad event should have a noticeable effect on the average.

redwall_hp · October 1, 2015

onChunkUnload() was called about 600 times, averaging 0.01ms per call in one of the more recent timings that you guys linked: http://timings.aikar.co/?url=12629871. You were saying that you saw an 11,000ms chunk unload event. It's really strange that this doesn't show up on the timings. 11000ms/600 = 18.3 ms. A single bad event should have a noticeable effect on the average.

One would assume. Then there's this timing: http://timings.aikar.co/?url=12629871

High load, lots of red events for entities and mob AI. But the average tps during the sample is still over 19. Usually when that happens, TPS drops significantly.

totemo · October 2, 2015

That's the same timing Redwall. I assume you meant to link a different one. Can we see the one that shows the problem?

I'm confused about the TPS drops. Last I heard, TPS was weirdly rock solid:

PvE lag topic

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members