/docs/MyDocs : revision 16

To get this branch, use:

bzr branch
http://darksoft.org/webbzr/docs/MyDocs

« back to all changes in this revision

Viewing changes to Analysis/practice/trace-readahead-with-database-io.html

Committer: Suren A. Chilingaryan
Date: 2015-08-21 03:52:00 UTC
Revision ID: csa@suren.me-20150821035200-xu1zh22cqlk2omcq

Profiling

files added:
Administration/Linux/system/storage/softraid/chunk_size.txt

Administration/Platforms/deb

Administration/Platforms/deb/key-management.txt

Administration/Platforms/portage

Administration/Platforms/portage/automake.txt

Administration/Platforms/portage/bugs.txt

Administration/Platforms/portage/catalyst.txt

Administration/Platforms/portage/crossdev.txt

Administration/Platforms/portage/gentoo.txt

Administration/Platforms/portage/packages.txt

Administration/Platforms/suse/problems.txt

Analysis

Analysis/Linux-analysis-tools-Im-Programmer.jpg

Analysis/accounting

Analysis/accounting/network.txt

Analysis/accounting/pstop.txt

Analysis/accounting/tiptop.txt

Analysis/accounting/traffic.mht

Analysis/analysis

Analysis/analysis/Runtime.errors.pdf

Analysis/analysis/rational.txt

Analysis/benchmark

Analysis/benchmark/caching.txt

Analysis/benchmark/cuda.txt

Analysis/benchmark/matlab.txt

Analysis/choosing-a-linux-tracer.html

Analysis/debugger

Analysis/debugger/gdb

Analysis/debugger/gdb/automation.txt

Analysis/debugger/gdb/expecting.txt

Analysis/debugger/gdb/flow.txt

Analysis/debugger/gdb/gdb.txt

Analysis/debugger/gdb/remote.txt

Analysis/debugger/gdb/signals.txt

Analysis/debugger/gdb/threads.txt

Analysis/debugger/gdb/ui.txt

Analysis/debugger/info.txt

Analysis/debugger/kernel

Analysis/debugger/kernel/using-gdb-for-debugging-kernel-modules.html

Analysis/debugger/malloc

Analysis/debugger/malloc/google.txt

Analysis/debugger/malloc/valgrind.txt

Analysis/debugger/totalview.txt

Analysis/gpu.txt

Analysis/instrumentation

Analysis/instrumentation/dtrace

Analysis/instrumentation/dtrace/dtrace.txt

Analysis/instrumentation/dtrace/linux_zones

Analysis/instrumentation/dtrace/linux_zones/dtrace_for_linux.html

Analysis/instrumentation/dtrace/linux_zones/dtrace_for_linux_files

Analysis/instrumentation/dtrace/linux_zones/dtrace_for_linux_files/feed-12x.gif

Analysis/instrumentation/dtrace/linux_zones/dtrace_for_linux_files/ga.js

Analysis/instrumentation/dtrace/linux_zones/dtrace_for_linux_files/metrics_group1.js

Analysis/instrumentation/dtrace/linux_zones/dtrace_for_linux_files/pacifica-custom.css

Analysis/instrumentation/dtrace/linux_zones/dtrace_for_linux_files/s_code_remote.js

Analysis/instrumentation/ebpf

Analysis/instrumentation/ebpf/ebpf.txt

Analysis/instrumentation/ebpf/samples.tar.bz2

Analysis/instrumentation/pin.txt

Analysis/instrumentation/systap

Analysis/instrumentation/systap/examples

Analysis/instrumentation/systap/examples/io.stp

Analysis/instrumentation/systap/examples/pcitool.stp

Analysis/instrumentation/systap/examples/socket_trace.stp

Analysis/instrumentation/systap/modules.txt

Analysis/instrumentation/systap/probes.txt

Analysis/list.txt

Analysis/practice

Analysis/practice/trace-readahead-with-database-io.html

Analysis/profiling

Analysis/profiling/gcc

Analysis/profiling/gcc/gcov.txt

Analysis/profiling/gcc/gnu.txt

Analysis/profiling/gcc/gprof.txt

Analysis/profiling/langs

Analysis/profiling/langs/python.txt

Analysis/profiling/oprofile.txt

Analysis/profiling/outdated

Analysis/profiling/outdated/_list_.txt

Analysis/profiling/outdated/codeanalyst.txt

Analysis/profiling/outdated/perfmon2.txt

Analysis/profiling/outdated/perfsuite.txt

Analysis/profiling/perf-tools.txt

Analysis/profiling/perf.html

Analysis/profiling/perf.txt

Analysis/profiling/techniques

Analysis/profiling/techniques/cache-misses-intel.html

Analysis/profiling/vtune.txt

Analysis/tools

Analysis/tools/segments.txt

Analysis/tracers

Analysis/tracers/blktrace.txt

Analysis/tracers/debugfs.txt

Analysis/tracers/net

Analysis/tracers/net/ngrep.txt

Development/languages/bash/sed.txt

Development/peformance

Development/peformance/memory.txt

files removed:
Administration/Linux/packages

Administration/Linux/packages/deb

Administration/Linux/packages/deb/key-management.txt

Administration/Linux/packages/portage

Administration/Linux/packages/portage/automake.txt

Administration/Linux/packages/portage/bugs.txt

Administration/Linux/packages/portage/catalyst.txt

Administration/Linux/packages/portage/crossdev.txt

Administration/Linux/packages/portage/gentoo.txt

Administration/Linux/packages/portage/packages.txt

Administration/Linux/system/accounting

Administration/Linux/system/accounting/bios.txt

Administration/Linux/system/accounting/network

Administration/Linux/system/accounting/network/info.txt

Administration/Linux/system/accounting/network/ngrep.txt

Administration/Linux/system/accounting/network/traffic.mht

Administration/Linux/system/accounting/network/traffic.txt

Administration/Linux/system/accounting/ps.txt

Administration/Linux/system/accounting/ulimit.txt

Development/debugging

Development/debugging/analysis

Development/debugging/analysis/rational.txt

Development/debugging/benchmarking

Development/debugging/benchmarking/caching.txt

Development/debugging/benchmarking/cuda.txt

Development/debugging/benchmarking/matlab.txt

Development/debugging/debuggers.txt

Development/debugging/gdb

Development/debugging/gdb/automation.txt

Development/debugging/gdb/expecting.txt

Development/debugging/gdb/flow.txt

Development/debugging/gdb/gdb.txt

Development/debugging/gdb/remote.txt

Development/debugging/gdb/signals.txt

Development/debugging/gdb/threads.txt

Development/debugging/gdb/ui.txt

Development/debugging/gpu.txt

Development/debugging/kernel

Development/debugging/kernel/using-gdb-for-debugging-kernel-modules.html

Development/debugging/profilers.txt

Development/debugging/profiling

Development/debugging/profiling/gcc

Development/debugging/profiling/gcc/gcov.txt

Development/debugging/profiling/gcc/gprof.txt

Development/debugging/profiling/google.txt

Development/debugging/profiling/hw-counters

Development/debugging/profiling/hw-counters/cache-misses-intel.html

Development/debugging/profiling/memory.txt

Development/debugging/profiling/oprofile.txt

Development/debugging/profiling/outdated

Development/debugging/profiling/outdated/_list_.txt

Development/debugging/profiling/outdated/codeanalyst.txt

Development/debugging/profiling/outdated/perfmon2.txt

Development/debugging/profiling/outdated/perfsuite.txt

Development/debugging/profiling/perf.txt

Development/debugging/profiling/python.txt

Development/debugging/profiling/valgrind.txt

Development/debugging/profiling/vtune.txt

Development/debugging/tracers

Development/debugging/tracers.txt

Development/debugging/tracers/blktrace.txt

Development/debugging/tracers/debugfs.txt

Development/debugging/tracers/dtrace

Development/debugging/tracers/dtrace/dtrace.txt

Development/debugging/tracers/dtrace/linux_zones

Development/debugging/tracers/dtrace/linux_zones/dtrace_for_linux.html

Development/debugging/tracers/dtrace/linux_zones/dtrace_for_linux_files

Development/debugging/tracers/dtrace/linux_zones/dtrace_for_linux_files/feed-12x.gif

Development/debugging/tracers/dtrace/linux_zones/dtrace_for_linux_files/ga.js

Development/debugging/tracers/dtrace/linux_zones/dtrace_for_linux_files/metrics_group1.js

Development/debugging/tracers/dtrace/linux_zones/dtrace_for_linux_files/pacifica-custom.css

Development/debugging/tracers/dtrace/linux_zones/dtrace_for_linux_files/s_code_remote.js

Development/debugging/tracers/systap

Development/debugging/tracers/systap/examples

Development/debugging/tracers/systap/examples/io.stp

Development/debugging/tracers/systap/modules.txt

Development/debugging/validating

Development/debugging/validating/Runtime.errors.pdf

Show diffs side-by-side

added added

removed removed

Analysis/practice/trace-readahead-with-database-io.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">

<html>

<head><title>Ftrace: The hidden light switch [LWN.net]</title>

var ados_keywords = ados_keywords || [];

if( location.protocol=='https:' ) {

ados_keywords.push('T:SSL');

} else {

ados_keywords.push('T:HTTP');

}

var ados = ados || {};

ados.run = ados.run || [];

ados.run.push(function() {

ados_add_placement(4669, 20979, "azk13321_leaderboard", 4).setZone(16026);

ados_add_placement(4669, 20979, "azk93271_right_zone", [5,10,6]).setZone(16027);

ados_add_placement(4669, 20979, "azk31017_tracking", 20).setZone(20995);

ados_setKeywords(ados_keywords.join(', '));

ados_load();

});</script>

</head>

<tr>

<a href="/"><img src="/images/lcorner.png" width="153"

height="120" border="0" alt="LWN.net Logo"></a>

</center>

<p>

google_ad_client = "pub-4358676377058562";

google_ad_width = 120;

google_ad_height = 240;

google_ad_format = "120x240_as";

google_ad_type = "text_image";

//2007-10-07: side ads

google_ad_channel = "0946045135";

google_color_border = "ffcc99";

google_color_bg = "ffcc99";

google_color_link = "0000FF";

google_color_text = "000000";

google_color_url = "008000";

//-->

</script>

src="//pagead2.googlesyndication.com/pagead/show_ads.js">

</script>

<p>

<p class="Header">Not logged in</p>

<p><a href="/newaccount">Create an account</a></p>

<p><a href="/subscribe/Info">Subscribe to LWN</a></p></div>

<p class="Header">Weekly Edition</p>

<p><a href="/Articles/607987/">Return to the Kernel page</a></p></div>

<p class="Header">Recent Features</p>

<p><a href="/Articles/654633/">Development statistics for the 4.2 kernel</a></p>

<p><a href="/Articles/653711/">LWN.net Weekly Edition for August 13, 2015</a></p>

<p><a href="/Articles/654128/">Working with xdg-app application bundles</a></p>

<p><a href="/Articles/653180/">LWN.net Weekly Edition for August 6, 2015</a></p>

<p><a href="/Articles/653411/">"Big data" features coming in PostgreSQL 9.5</a></p></div>

<p><a href="/Articles/608497/?format=printable">Printable page</a></p></div>

</td>

<tr>

<td class="NavLink"><a href="/current/">Weekly edition</a></td>

<a href="/Kernel/">Kernel</a></td>

100

<td class="NavLink"><a href="/Security/">Security</a></td>

101

102

<a href="http://lwn.net/Distributions/">Distributions</a></td>

103

<td class="NavLink"><a href="/op/FAQ.lwn#contact">Contact Us</a> </td>

104

<td class="NavLink"><a href="/Search/">Search</a> </td>

105

</tr>

106

107

<tr>

108

<td class="NavLink"><a href="/Archives/">Archives</a></td>

109

<td class="NavLink"><a href="/Calendar/">Calendar</a></td>

110

<td class="NavLink"><a href="/subscribe/Info">Subscribe</a></td>

111

<td class="NavLink"><a href="/op/AuthorGuide.lwn">Write for LWN</a></td>

112

113

<td class="NavLink"><a href="/op/Sponsors.lwn">Sponsors</a></td>

114

</tr>

115

116

</table>

117

</td><td></td></tr>

118

119

120

121

<h1>Ftrace: The hidden light switch</h1>

122

</div>

123

124

125

<p>August 13, 2014</p>

126

<p>This article was contributed by <a href="http://www.brendangregg.com">Brendan Gregg</a></p>

127

</div>

128

<p>You may think, as I did, that analyzing the Linux kernel is like

129

venturing through a dark dungeon: without the addition of advanced tracers

130

like SystemTap, there's much that can't be seen, and can only be

131

inferred.

132

133

However, I've recently found hidden switches that turn on some bright

134

lights, strategically placed by Steven Rostedt and others since the 2.6.27

135

release. These are the ftrace profilers. I haven't even tried all the

136

switches yet, but I'm stunned at what I've seen so far, and I'm having to

137

rethink what I previously believed about Linux kernel performance

138

analysis.</p>

139

140

<p>Recently at Netflix (where I work), a Cassandra database was performing

141

poorly after a

142

system upgrade, and disk I/O inflation (a massive

143

increase in the number of I/O operations submitted) was suspected. There

144

can be many

145

causes for this: a worse cache-hit ratio, record-size inflation, readahead

146

inflation, other applications, even other asynchronous kernel tasks (file

147

system background scrubs). The question was: which one, and how do we fix

148

it?</p>

149

150

<h4 id="iosnoop">1. iosnoop</h4>

151

<p>I started with basic server health checks, and then my <a

152

href="http://www.brendangregg.com/blog/2014-07-16/iosnoop-for-linux.html"><tt>iosnoop</tt></a>

153

tool.

154

<tt>iosnoop</tt> is a shell script that uses the <tt>/sys/kernel/debug</tt>

155

ftrace

156

facilities, and is in my <a

157

href="https://github.com/brendangregg/perf-tools">perf-tools</a> collection

158

on GitHub, along with other ftrace hacks. These work on Netflix's existing

159

servers (which often run Linux 3.2 with security patches) without any

160

other addition to the system, and without requiring kernel debuginfo.

161

In this case, <tt>iosnoop</tt> was run both with

162

and without <tt>-Q</tt> to see the effect of queuing:</p>

163

164

<pre>

165

# <b>./iosnoop -ts</b>

166

STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms

167

13370264.614265 13370264.614844 java 8248 R 202,32 1431244248 45056 0.58

168

13370264.614269 13370264.614852 java 8248 R 202,32 1431244336 45056 0.58

169

13370264.614271 13370264.614857 java 8248 R 202,32 1431244424 45056 0.59

170

13370264.614273 13370264.614868 java 8248 R 202,32 1431244512 45056 0.59

171

[...]

172

# <b>./iosnoop -Qts</b>

173

STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms

174

13370410.927331 13370410.931182 java 8248 R 202,32 1596381840 45056 3.85

175

13370410.927332 13370410.931200 java 8248 R 202,32 1596381928 45056 3.87

176

13370410.927332 13370410.931215 java 8248 R 202,32 1596382016 45056 3.88

177

13370410.927332 13370410.931226 java 8248 R 202,32 1596382104 45056 3.89

178

[...]

179

</pre>

180

181

<p>I didn't see anything out of the ordinary: a higher disk I/O load was

182

causing higher queue times.</p>

183

184

185

<p>The tools that follow are all from the same collection: they demonstrate

186

existing capabilities of ftrace, and how they are useful for solving real

187

problems.</p>

188

189

<h4 id="tpoint">2. tpoint</h4>

190

191

<p>To investigate these disk reads in more detail, I used <tt>tpoint</tt> to

192

trace the <tt>block:block_rq_insert</tt> tracepoint:</p>

193

<p>

194

<pre>

195

# <b>./tpoint -H block:block_rq_insert</b>

196

Tracing block:block_rq_insert. Ctrl-C to end.

197

# tracer: nop

198

199

# TASK-PID CPU# TIMESTAMP FUNCTION

200

# | | | | |

201

java-16035 [000] 13371565.253582: block_rq_insert: 202,16 WS 0 () 550505336 + 88 [java]

202

java-16035 [000] 13371565.253582: block_rq_insert: 202,16 WS 0 () 550505424 + 56 [java]

203

java-8248 [007] 13371565.278372: block_rq_insert: 202,32 R 0 () 660621368 + 88 [java]

204

java-8248 [007] 13371565.278373: block_rq_insert: 202,32 R 0 () 660621456 + 88 [java]

205

java-8248 [007] 13371565.278374: block_rq_insert: 202,32 R 0 () 660621544 + 24 [java]

206

java-8249 [007] 13371565.311507: block_rq_insert: 202,32 R 0 () 660666416 + 88 [java]

207

[...]

208

</pre>

209

210

<p>I was checking for anything obviously odd, but the I/O details looked

211

normal. The <tt>-H</tt> option prints column headers.</p>

212

213

<p>Next, I traced the code path that led to this I/O by printing stack

214

traces (<tt>-s</tt>), to see if they contained an explanation. I also added an

215

in-kernel filter to match reads only (when the "<tt>rwbs</tt>"

216

flag field contains "<tt>R</tt>"):</p>

217

218

<pre>

219

# <b>./tpoint -s block:block_rq_insert 'rwbs ~ "*R*"' | head -1000</b>

220

Tracing block:block_rq_insert. Ctrl-C to end.

221

java-8248 [005] 13370789.973826: block_rq_insert: 202,16 R 0 () 1431480000 + 8 [java]

222

java-8248 [005] 13370789.973831: <stack trace>

223

=> blk_flush_plug_list

224

=> blk_queue_bio

225

=> generic_make_request.part.50

226

=> generic_make_request

227

=> submit_bio

228

=> do_mpage_readpage

229

=> mpage_readpages

230

=> xfs_vm_readpages

231

=> read_pages

232

=> __do_page_cache_readahead

233

=> ra_submit

234

=> do_sync_mmap_readahead.isra.24

235

=> filemap_fault

236

=> __do_fault

237

=> handle_pte_fault

238

=> handle_mm_fault

239

=> do_page_fault

240

=> page_fault

241

java-8248 [005] 13370789.973831: block_rq_insert: 202,16 R 0 () 1431480024 + 32 [java]

242

java-8248 [005] 13370789.973836: <stack trace>

243

=> blk_flush_plug_list

244

=> blk_queue_bio

245

=> generic_make_request.part.50

246

[...]

247

</pre>

248

249

<p>Great! The output is similar to the previous example, but with stack

250

traces beneath each tracepoint event. I limited the output using <tt>head</tt> as

251

it is verbose.</p>

252

253

<p><tt>tpoint</tt> is another ftrace-based tool. It's usually better to use

254

perf events (the

255

<tt>perf</tt> command) for this

256

particular use case, as it can handle multi-user access to performance data

257

and a higher event

258

rate, although it is more time-consuming to use. I just wanted to quickly

259

eyeball a few dozen stack traces for a given tracepoint.</p>

260

261

<p>The stacks were mostly the same as the example above, which provided a

262

couple of leads: page faults and readahead. This Ubuntu system was using 2MB

263

direct-mapped pages, instead of 4KB like the old system. It also

264

had readahead set to 2048KB, instead of 128KB. Either of these

265

differences could be causing the inflation, although tuning readahead had

266

already been tested, and found to make no difference.</p>

267

268

<h4 id="funccount">3. funccount</h4>

269

270

<p>I wanted to understand that stack trace better, so I started by counting

271

calls using <tt>funccount</tt>, which uses ftrace function profiling. Starting

272

with the per-second rate of <tt>submit_bio()</tt>:</p>

273

274

<pre>

275

# <b>./funccount -i 1 submit_bio</b>

276

Tracing "submit_bio"... Ctrl-C to end.

277

278

FUNC COUNT

279

submit_bio 27881

280

281

FUNC COUNT

282

submit_bio 28478

283

[...]

284

</pre>

285

286

<p>This rate, about 28,000 calls per second, is on par with what the disks

287

are doing

288

as seen from <tt>iostat</tt>. <tt>funccount</tt> is counting events in the

289

kernel for efficiency.</p>

290

291

<p>Now checking the rate of <tt>filemap_fault()</tt>, which is closer in

292

the stack to the database:</p>

293

294

<pre>

295

# <b>./funccount -i 1 filemap_fault</b>

296

Tracing "filemap_fault"... Ctrl-C to end.

297

298

FUNC COUNT

299

filemap_fault 2203

300

301

FUNC COUNT

302

filemap_fault 3227

303

[...]

304

</pre>

305

306

<p>This is consistent with what we believed the application was requesting

307

from the filesystem. There is about a 10x inflation between these calls

308

and the issued disk I/O (as evidenced by the <tt>submit_bio()</tt> calls).</p>

309

310

<h4 id="funcslower">4. funcslower</h4>

311

312

<p>Just to confirm that the database is suffering latency caused by the stack

313

trace I was studying, I used <tt>funcslower</tt> (another ftrace-based

314

tool, which uses in-kernel timing and filtering for efficiency)

315

to measure <tt>filemap_fault()</tt> calls

316

taking longer than 1000 microseconds (1ms):</p>

317

318

<pre>

319

# <b>./funcslower -P filemap_fault 1000</b>

320

Tracing "filemap_fault" slower than 1000 us... Ctrl-C to end.

321

0) java-8210 | ! 5133.499 us | } /* filemap_fault */

322

0) java-8258 | ! 1120.600 us | } /* filemap_fault */

323

0) java-8235 | ! 6526.470 us | } /* filemap_fault */

324

2) java-8245 | ! 1458.30 us | } /* filemap_fault */

325

[...]

326

</pre>

327

328

<p>These latencies look similar to those seen from disk I/O (with queue

329

time). I'm in the right area.</p>

330

331

<h4 id="funccount-again">5. funccount (again)</h4>

332

333

<p>I noticed that the stack has "readpage" calls and then

334

"readpage<b>s</b>". Tracing them both at the same time:</p>

335

336

<pre>

337

# <b>./funccount -i 1 '*mpage_readpage*'</b>

338

Tracing "*mpage_readpage*"... Ctrl-C to end.

339

340

FUNC COUNT

341

mpage_readpages 364

342

do_mpage_readpage 122930

343

344

FUNC COUNT

345

mpage_readpages 318

346

do_mpage_readpage 110344

347

[...]

348

</pre>

349

350

<p>Here's our inflation: <tt>mpage_readpages()</tt> is being called about

351

300 times per second, and then <tt>do_mpage_readpage()</tt> over 100k times

352

per second. This still looks like readahead, although we did try to adjust

353

readahead sizes as an

354

experiment, and it didn't make a difference.</p>

355

356

<h4 id="kprobe">6. kprobe</h4>

357

358

<p>Maybe our readahead tuning didn't take effect? I can check this using

359

dynamic tracing of kernel functions.

360

361

Starting with the above stack trace, I saw that

362

<tt>__do_page_cache_readahead()</tt>

363

has <tt>nr_to_read</tt> (number of pages to read) as an argument, which

364

comes from

365

the readahead setting. Using <tt>kprobe</tt>, an ftrace- and kprobes-based tool,

366

to dynamically trace this function and argument:</p>

367

368

<pre>

369

# <b>./kprobe -H 'p:do __do_page_cache_readahead nr_to_read=%cx'</b>

370

Tracing kprobe m. Ctrl-C to end.

371

# tracer: nop

372

373

# TASK-PID CPU# TIMESTAMP FUNCTION

374

# | | | | |

375

java-8714 [000] 13445354.703793: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200

376

java-8716 [002] 13445354.819645: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200

377

java-8734 [001] 13445354.820965: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200

378

java-8709 [000] 13445354.825280: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200

379

[...]

380

</pre>

381

382

<p>I used <tt>-H</tt> to print the header, and <tt>p:</tt> to specify that we

383

will create a probe on function entry, which we'll call "do"

384

(that alias is optional). The rest of that line specifies the function and

385

optional arguments.

386

387

Without kernel debuginfo, I can't refer to the <tt>nr_to_read</tt>

388

symbol, so I need to use registers instead. I guessed %cx: If true, then

389

our tuning hasn't taken hold, as 200 in hex is 512 pages: the original 2048KB.</p>

390

391

<h4 id="funcgraph">7. funcgraph</h4>

392

393

<p>To be sure, I read the code to see how this value is passed around, and

394

used <tt>funcgraph</tt> to illustrate it:</p>

395

396

<pre>

397

# <b>./funcgraph -P filemap_fault | head -1000</b>

398

2) java-8248 | | filemap_fault() {

399

2) java-8248 | 0.568 us | find_get_page();

400

2) java-8248 | | do_sync_mmap_readahead.isra.24() {

401

2) java-8248 | 0.160 us | max_sane_readahead();

402

2) java-8248 | | ra_submit() {

403

2) java-8248 | | __do_page_cache_readahead() {

404

2) java-8248 | | __page_cache_alloc() {

405

2) java-8248 | | alloc_pages_current() {

406

2) java-8248 | 0.228 us | interleave_nodes();

407

2) java-8248 | | alloc_page_interleave() {

408

2) java-8248 | | __alloc_pages_nodemask() {

409

2) java-8248 | 0.105 us | next_zones_zonelist();

410

2) java-8248 | | get_page_from_freelist() {

411

2) java-8248 | 0.093 us | next_zones_zonelist();

412

2) java-8248 | 0.101 us | zone_watermark_ok();

413

2) java-8248 | | zone_statistics() {

414

2) java-8248 | 0.073 us | __inc_zone_state();

415

2) java-8248 | 0.074 us | __inc_zone_state();

416

2) java-8248 | 1.209 us | }

417

2) java-8248 | 0.142 us | prep_new_page();

418

2) java-8248 | 3.582 us | }

419

2) java-8248 | 4.810 us | }

420

2) java-8248 | 0.094 us | inc_zone_page_state();

421

[...]

422

</pre>

423

424

<p><tt>funcgraph</tt> uses another ftrace feature: the function graph

425

profiler. It has moderate overhead, since it traces all kernel functions,

426

so I only use it for exploratory purposes like this one-off. The output

427

shows the code-flow in the kernel, and even has time deltas in

428

microseconds. It's the call to <tt>max_sane_readahead()</tt> that is

429

interesting, as that fetches the readahead value it wants to use.</p>

430

431

<h4 id="kprobe-again">8. kprobe (again)</h4>

432

433

<p>This time I'll trace the return of the <tt>max_sane_readahead()</tt>

434

function:</p>

435

<pre>

436

# <b>./kprobe 'r:m max_sane_readahead $retval'</b>

437

Tracing kprobe m. Ctrl-C to end.

438

java-8700 [000] 13445377.393895: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \

439

max_sane_readahead) arg1=200

440

java-8723 [003] 13445377.396362: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \

441

max_sane_readahead) arg1=200

442

java-8701 [001] 13445377.398216: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \

443

max_sane_readahead) arg1=200

444

java-8738 [000] 13445377.399793: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \

445

max_sane_readahead) arg1=200

446

java-8728 [000] 13445377.408529: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \

447

max_sane_readahead) arg1=200

448

[...]

449

</pre>

450

451

<p>This is also 0x200 pages: 2048KB, and this time I used the <tt>$retval</tt>

452

alias instead of guessing registers. So the tuning really did not take

453

effect.

454

Studying the kernel source, I saw that the readahead property was set by

455

a function called <tt>file_ra_state_init()</tt>. Under what circumstances is

456

<em>that</em> called, and how do I trigger it? ftrace/kprobes to the rescue

457

again:</p>

458

459

<pre>

460

# <b>./kprobe -s p:file_ra_state_init</b>

461

Tracing kprobe m. Ctrl-C to end.

462

kprobe-20331 [002] 13454836.914913: file_ra_state_init: (file_ra_state_init+0x0/0x30)

463

kprobe-20331 [002] 13454836.914918: <stack trace>

464

=> vfs_open

465

=> nameidata_to_filp

466

=> do_last

467

=> path_openat

468

=> do_filp_open

469

=> do_sys_open

470

=> sys_open

471

=> system_call_fastpath

472

kprobe-20332 [007] 13454836.915191: file_ra_state_init: (file_ra_state_init+0x0/0x30)

473

kprobe-20332 [007] 13454836.915194: <stack trace>

474

=> vfs_open

475

=> nameidata_to_filp

476

[...]

477

</pre>

478

479

<p>This time I used <tt>-s</tt> to print stack traces, which showed that

480

this often

481

happens from the <tt>open()</tt> syscall. As I'd left Cassandra running

482

while tuning

483

readahead, it may not have reopened its files and run

484

<tt>file_ra_stat_init()</tt>.

485

486

So I restarted Cassandra to see if the readahead tuning would then take

487

effect, and re-measured:</p>

488

489

<pre>

490

# <b>./kprobe 'r:m max_sane_readahead $retval'</b>

491

Tracing kprobe m. Ctrl-C to end.

492

java-11918 [007] 13445663.126999: m: (ondemand_readahead+0x3b/0x230 <- \

493

max_sane_readahead) arg1=80

494

java-11918 [007] 13445663.128329: m: (ondemand_readahead+0x3b/0x230 <- \

495

max_sane_readahead) arg1=80

496

java-11918 [007] 13445663.129795: m: (ondemand_readahead+0x3b/0x230 <- \

497

max_sane_readahead) arg1=80

498

java-11918 [007] 13445663.131164: m: (ondemand_readahead+0x3b/0x230 <- \

499

max_sane_readahead) arg1=80

500

[...]

501

</pre>

502

503

<p>Success!</p>

504

505

<p><tt>iostat</tt> showed a large drop in disk I/O, and the database latency

506

measurements were much better. This was simply a readahead change, where

507

the new Ubuntu instances defaulted to 2048KB. What had misled us

508

earlier was that tuning it had not appeared to make a difference, as the

509

setting wasn't taking effect.</p>

510

511

<p>Another tunable we checked was the I/O scheduler, and whether changing

512

it from "deadline" to "noop" was immediate. Using:

513

<p>

514

<pre>

515

./funccount -i 1 'deadline*'

516

./funccount -i 1 'noop*'

517

</pre>

518

<p>

519

gave us the answer: they showed the rate of the related

520

kernel functions, and that the tuning was indeed immediate.</p>

521

522

<h4 id="ftrace-and-perf-tools">ftrace and perf-tools</h4>

523

524

<p>All the tools I used here are from my

525

perf-tools

526

collection, which are front-ends to ftrace and related tracers (kprobes,

527

tracepoints, and the function profiler). I've described some of them as

528

hacks, which they are, as they use creative workarounds for the lack of

529

some in-kernel features. For example, <tt>iosnoop</tt> reads both issue and

530

completion events in user space, and calculates the latencies there,

531

instead of doing that more efficiently in kernel context.</p>

532

533

<p>These tools are for older Linux systems without kernel debuginfo, and

534

show what Linux can do using only its built-in ftrace. There's even more to

535

ftrace that I didn't show here: profiling function average latency, tracing

536

wakeup events, function probe triggers, etc. There is also Steven Rostedt's

537

own front-end to ftrace, a multi-tool called <a

538

href="http://git.kernel.org/cgit/linux/kernel/git/rostedt/trace-cmd.git">trace-cmd</a>

539

(<a href="/Articles/410200/">covered on LWN</a> in 2010),

540

which can do more than my collection of smaller tools, and is also much

541

easier to use than operating on the <tt>/sys</tt> files directly.</p>

542

543

<p>To do even more, perf_events (which is also part of the Linux kernel

544

source) with kernel debuginfo lets me examine complex data structures, and

545

even trace kernel code by line number and watch local variables. What I still

546

can't do, however, is perform complex in-kernel aggregations until I add

547

other advanced tracers like <a

548

href="https://sourceware.org/systemtap/">SystemTap</a>, <a

549

href="/Articles/551314/">ktap</a>, or coming up: <a

550

href="/Articles/603983/">eBPF</a>.</p>

551

552

<h4 id="future-work">Future Work</h4>

553

554

<p>Changes are on the way for the Linux kernel, as regular LWN

555

readers likely know. The capabilities

556

I need, to avoid using hacks, may be provided by eBPF, which may be merged

557

<a href="http://lwn.net/Articles/606089/">as soon as Linux 3.18</a>. If

558

that happens, I'll be happy to create new versions of these tools making

559

use of eBPF. They will likely look and feel the same, but their

560

implementation will be much more efficient and reliable. I'll also be able

561

to create many more tools without the fear of maintaining too many

562

hacks.</p>

563

564

<p>Although, even if Linux does bring these capabilities in an upcoming

565

release, it’ll be some time before I can really use them in production in

566

my work environment, depending on how quickly we can or want to be on the

567

latest kernel. So, while my hacks are temporary workarounds, they may be

568

useful for some time to come.</p>

569

570

<h4 id="conclusion">Conclusion</h4>

571

572

<p>We have a large number of Linux cloud instances to analyze at Netflix,

573

and some interesting and advanced performance issues to solve. While we've

574

been looking at using advanced tracers like SystemTap, I've also been

575

studying what's there already, including the ftrace profilers. The

576

scripts I showed in

577

this post use ftrace, so they work on our our existing instances as-is,

578

without even installing kernel debuginfo.</p>

579

580

<p>In this particular example, a Cassandra database experienced a disk I/O

581

inflation issue caused by an increased readahead setting. I traced disk

582

I/O events and latencies, and how these were created in the kernel. This

583

included examining stack traces, counting function-call rates, measuring

584

slow function times, tracing call graphs, and dynamic tracing of function

585

calls and returns, with their arguments and return values.</p>

586

587

<p>I did all of this using ftrace, which has been in the Linux kernel for

588

years. I found the hidden light switches.</p>

589

590

<p>If you are curious about the inner workings of these ftrace tools, see

591

"<a href="http://lwn.net/Articles/370423/">Secrets of the Ftrace function

592

tracer</a>" by Steven Rostedt, "Debugging the kernel using Ftrace" <a

593

href="http://lwn.net/Articles/365835/">part 1</a> and <a

594

href="http://lwn.net/Articles/366796/">part 2</a>, and <a

595

href="http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/ftrace.txt">Documentation/trace/ftrace.txt</a>

596

in the kernel source. The most important lesson here isn't about my tools,

597

but that this level of tracing is even possible on existing Linux

598

kernels. And if eBPF is added, a <em>lot</em> more will be possible.</p><hr width="60%" align="left">

599

(<a href="https://lwn.net/login?target=/Articles/608497/">Log in</a> to post comments)

600

<p>

601

602

</div>

603

604

605

606

607

<p class="CommentTitle">Ftrace: The hidden light switch</p>

608

609

<p class="CommentPoster">Posted Aug 17, 2014 19:45 UTC (Sun) by <b>bernat</b> (subscriber, #51658)

610

[<a href="/Articles/608810/">Link</a>]

611

</p>

612

613

Great article!<br>

614

<p>

615

Unfortunately, unlike Ubuntu kernels, Debian kernels come with most useful tracers disabled. This seems due to this bug:<br>

616

<a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=568025">https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=568025</a><br>

617

<p>

618

I have opened a bug report to ask for them to be enabled again:<br>

619

<a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=758469">https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=758469</a><br>

620

<p>

621

With the fact that Debian does not come with debug symbols for all packages (something that I hope to address in the future), this makes Debian a bit more unfriendly with respect to all those performance tools, compared to Ubuntu.<br>

622

</div>

623

624

</div>

625

<p>

626

627

628

629

</form>

630

</div>

631

632

</div>

633

634

635

636

637

<p class="CommentTitle">Ftrace: The hidden light switch</p>

638

639

<p class="CommentPoster">Posted Aug 25, 2014 9:11 UTC (Mon) by <b>Shugyousha</b> (subscriber, #93672)

640

[<a href="/Articles/609449/">Link</a>]

641

</p>

642

643

A great article indeed.<br>

644

<p>

645

There is a lot of tracing infrastructure in the Kernel and a lot of tools for tracing available. It would be nice if we could get an(other?) article comparing the different tools and their capabilities. Candidates for such a comparison would be SystemTap, ktap, eBPF, the Kernel perf tool and maybe the newer sysdig (<a href="http://www.sysdig.org/">http://www.sysdig.org/</a>) for example.<br>

646

<p>

647

<p>

648

</div>

649

650

</div>

651

<p>

652

653

654

655

</form>

656

</div>

657

658

</div>

659

660

661

662

663

<p class="CommentTitle">Ftrace: The hidden light switch</p>

664

665

<p class="CommentPoster">Posted Sep 12, 2014 6:45 UTC (Fri) by <b>bgregg</b> (subscriber, #46639)

666

[<a href="/Articles/611687/">Link</a>]

667

</p>

668

669

I'd like to write a comparative article at some point, although it is challenging to compare them all, since there's so many factors to take into account. And I'd like to do more research first, such as using SystemTap's non-debuginfo features more (which are practically undocumented, but should be an important factor for real world usage), spend more time with LTTng, try out the ftrace function triggers, try writing a kernel module for tracing based on samples/kprobes/*.c (has no one done this? I can find zero examples other than the kernel source), etc.<br>

670

<p>

671

The best place for tracing system comparisons has, historically, been here on lwn.net, like the ktap vs eBPF article, and others by Jonathan Corbet. So, keep reading... :-)<br>

672

</div>

673

674

</div>

675

<p>

676

677

678

679

</form>

680

</div>

681

682

</div>

683

</div>

684

</div>

685

686

687

688

<p class="CommentTitle">Ftrace: The hidden light switch</p>

689

690

<p class="CommentPoster">Posted Mar 3, 2015 9:24 UTC (Tue) by <b>ituralde</b> (guest, #101279)

691

[<a href="/Articles/635324/">Link</a>]

692

</p>

693

694

Is it possible to use perf-tools (or merely ftrace of perf_events) to profile functions of kernel modules? By modules I mean once that are inserted to a stable kernel.<br>

695

The Q might be silly, I am quite a novice in this matter.<br>

696

</div>

697

698

</div>

699

<p>

700

701

702

703

</form>

704

</div>

705

706

</div>

707

</td>

708

709

710

</td>

711

</tr></table></td>

712

</tr></table>

713

714

715

716

<P>

717

718

719

720

Comments and public postings are copyrighted by their creators.<br>

721

Linux is a registered trademark of Linus Torvalds<br>

722

</font>

723

</center>

724

725

</body></html>

726

b'\\ No newline at end of file'

Older »