Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce GetRangeAndFlatMap to push computations down to FDB #5609

Merged
merged 2 commits into from
Nov 4, 2021

Conversation

nblintao
Copy link
Contributor

@nblintao nblintao commented Sep 16, 2021

It introduces GetRangeAndHop semantic to FDB. With this, SS is able to generate the keys in the queries based on another query. In another word, we can push some computations at the upper layer down to FDB. For example, querying an index and fetch records can be done in a single query! This is expected to improve latency and bandwidth when reading FDB.

See the tests to have a taste how this feature could be used.

This change doesn't affect existing code pass as long as you don't use the new APIs.

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or master if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for macOS Catalina 10.15

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: 47c7f2a
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 47c7f2a
  • Result: FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)

state Transaction tr(data->cx);
tr.setVersion(version);
// TODO: is DefaultPromiseEndpoint the best priority for this?
tr.info.taskID = TaskPriority::DefaultPromiseEndpoint;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means the second read go through the read path front door.

Can we craft something like how you get the first read?

GetValueRequest req(Span().context, key, version, Optional<TagSet>(), Optional<UID>());
 			data->actors.add(data->readGuard(req, getValueQ));
 			GetValueReply reply = wait(req.reply.getFuture());

If the second read is also on the same SS, you will save the overhead of getKeyLocation() and the serialization and deserialization on the SS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you meant quickGetValue = the current two-hop solution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first query's code is at getKeyValuesAndHopQ which is effectively same as a normal range scan aka getKeyValuesQ.

hop() issues the second queries from the results returned by the first query.

quickGetValue is the second query.

// TODO: is DefaultPromiseEndpoint the best priority for this?
tr.info.taskID = TaskPriority::DefaultPromiseEndpoint;
Future<Optional<Value>> valueFuture = tr.get(key, Snapshot::True);
// TODO: async in case it needs to read from other servers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reply error if it needs to hop to another server and let client side retry?

Doing like this means the first SS is doing the same amount of work client does if the second-hop key is not colocated with the first-hop key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's another design choice. See the discussion in the design doc.

I'm currently using this one because it has the simplest API and least modification in client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumengpanda I would prefer first SS "do the work". If we can bypass JNI and not have to serde index records, it is a win

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I forgot the JNI overhead when I wrote the comment.

Letting the first SS forward the requests and do the work can avoid JNI and is likely better.
I'd like to point out its tradeoff as well:

  • Suppose the data flow is client->SS1 (to find index) -> SS2 (to find the value) -> SS2 directly reply to client, which means the first SS forwards the request to SS2
  • SS1 will need to cache the shard-to-SS mapping, which causes extra memory. That means we use some extra memory on storage node to save some cpu time on compute node.

We probably don't have to decide the corner case now.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three possibilities, I believe:

  1. first SS does all the proxy calls to wherever
  2. C client does any proxy calls when hopping within SS is not possible
  3. API client does any proxy calls

3 has the JNI overhead, but 2 doesn't. It's more modification, as @nblintao says, but I think it might be the most robust. Particularly once we undertake to do range scans of the hopped-to place.

state Transaction tr(data->cx);
tr.setVersion(version);
// TODO: is DefaultPromiseEndpoint the best priority for this?
tr.info.taskID = TaskPriority::DefaultPromiseEndpoint;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you meant quickGetValue = the current two-hop solution?

GetKeyValuesReply _r =
wait(readRange(data, version, KeyRangeRef(begin, end), req.limit, &remainingLimitBytes, span.context));

std::cout << "read range done, start hopping" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceEvent is better because it can be ingested into our log search sys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will change to TraceEvent when I finalize the code.

changeCounter,
KeyRangeRef(std::min<KeyRef>(begin, std::min<KeyRef>(req.begin.getKey(), req.end.getKey())),
std::max<KeyRef>(end, std::max<KeyRef>(req.begin.getKey(), req.end.getKey()))));
if (EXPENSIVE_VALIDATION) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making it knob will make your perf & correctness evaluation life on real clusters better

result.arena.dependsOn(hopKey.arena());

std::cout << "quickGetValue start key: " << hopKey.toString() << std::endl;
Optional<Value> valueOption = wait(quickGetValue(data, hopKey, input.version));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can make sure the input.data indexed data is still on the same SS, quickGetValue() will not need the transaction inside it. right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,171 @@
/*
* IndexPrefetchDemo.actor.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means you can run this in multi-tester for perf tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Could you elaborate it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi-tester is a load generator framework FDB uses to generate load.
Each fdbserver can start as a tester role, which will pick up the specified workloads, like the one you wrote.
These testers will behave as "clients generate traffic to the test cluster" .

Let's sync offline on how to set it up in our internal infra.

@nblintao nblintao changed the title [WIP] Index Prefetch Demo Index Prefetch Demo Sep 17, 2021
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for macOS Catalina 10.15

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: 33ad45e
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 33ad45e
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 626e517
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@@ -278,11 +283,35 @@ struct GetKeyValuesReply : public LoadBalancedReply {
}
};

// Describe how to hop. Assume the keys are encoded as Tuple, hence there are concept of "elements".
struct HopInfo {
// Do GetKeyValues, extract the last suffixLen elements from each keys.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make this a bit more efficient with modest additional cost in request size by adding a number of bytes to skip before starting to parse a tuple. This would also cover cases where, for some reason, the entire key isn't a well-formed tuple, but the same secondary index pattern is followed after some point.

int suffixLen;
// Form hop keys by appending the fetched suffixes to hopPrefix. Do GetValue using the hop keys.
// TODO: Support getting ranges using the hop keys as prefixes.
KeyRef hopPrefix;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this covers the important cases, but to make the scope explicit.

The simplest case is where the scanned keys are

dir, 'I', value, id1, id2

and the key-values to be fetched are

dir, 'R', id1, id2, n

(n might be a split record counter or an exploded per-field key)

This also handles cases where the scanned keys are

dir, 'I', id1, value, id2

and the key-values to be fetched are still

dir, 'R', id1, id2, n

provided that the scan fixes id1 so that it can be included in the skip prefix and the hop prefix. That is,

WHERE id1 = ? AND value BETWEEN ? AND ? ORDER BY  value

It is not sufficient for cases where id1 needs to be extracted for the hop. Such as,

WHERE id1 > ? ORDER BY id1, value

Nor for the case where the index is

dir, 'I', id2, id1

for scans like

id2 > ? ORDER BY id2

Again, I am not sure more complicated tuple element shuffling descriptors are worth the trouble. The two supported cases match the ones I have seen in real life.

Copy link
Contributor Author

@nblintao nblintao Sep 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can update int suffixLen to a vector of picked item indexes in the expected order.

To make it more expressive (Is there a use case of interleaving the fetched items with the constant items?), and probably more elegant, I'm also thinking about using a single tuple to express both suffixLen and hopPrefix. It could be something like:

dir, "R", "%2", "%4", "%.*"

"%.*" is representing doing a range query rather than point query.
Escaping on the strings are needed, of course.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case of interleaving the fetched items with the constant items?

We might distinguish three kinds of hop elements:

  • variable, fetched from the scanned keys
  • constant, occurring in both the scan range and the hop prefix
  • constant, occurring only in the hop prefix

The scanned range might arbitrarily permute its elements for the sake of ordering or inequality comparisons. So the first two interleave arbitrarily, too.

Placing the last someplace other than at the front is maybe a little more contrived, but suppose you have

dir, "I", value, group, id

and

dir, "R", group, "widget", id

where the constant type (widget) is implied for the original scan -- it's an index on that type only -- and so not part of its keys, but the typing is normalized within some form of group.

ACTOR Future<Optional<Value>> quickGetValue(StorageServer* data, StringRef key, Version version) {
if (data->shards[key]->isReadable()) {
try {
// TODO: Use a lower level API may be better? Or tweak priorities?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this needs to be a range request, to get all the key-value pairs starting with the hop prefix.

// TODO: is DefaultPromiseEndpoint the best priority for this?
tr.info.taskID = TaskPriority::DefaultPromiseEndpoint;
Future<Optional<Value>> valueFuture = tr.get(key, Snapshot::True);
// TODO: async in case it needs to read from other servers.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three possibilities, I believe:

  1. first SS does all the proxy calls to wherever
  2. C client does any proxy calls when hopping within SS is not possible
  3. API client does any proxy calls

3 has the JNI overhead, but 2 doesn't. It's more modification, as @nblintao says, but I think it might be the most robust. Particularly once we undertake to do range scans of the hopped-to place.

@nblintao nblintao force-pushed the index-prefetch-demo branch from 626e517 to aec1782 Compare October 7, 2021 05:13
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: aec1782
  • Result: FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for macOS Catalina 10.15

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: aec1782
  • Result: FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Logs (available for 30 days)

@nblintao nblintao force-pushed the index-prefetch-demo branch from aec1782 to 3ef815c Compare October 11, 2021 18:20
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 3ef815c
  • Result: FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Logs (available for 30 days)

@nblintao nblintao force-pushed the index-prefetch-demo branch from 3ef815c to d9a6aa3 Compare October 12, 2021 01:23
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for macOS Catalina 10.15

  • CodeBuild project: foundationdb-pull-request-build-macos
  • Commit ID: d9a6aa3
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: d9a6aa3
  • Result: FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)

@nblintao nblintao force-pushed the index-prefetch-demo branch from d9a6aa3 to 99f99a0 Compare October 21, 2021 01:06
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: 99f99a0
  • Result: FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for macOS Catalina 10.15

  • CodeBuild project: foundationdb-pr-macos
  • Commit ID: 99f99a0
  • Result: FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Logs (available for 30 days)

@nblintao nblintao force-pushed the index-prefetch-demo branch from 99f99a0 to bed03c6 Compare October 23, 2021 00:45
@nblintao nblintao changed the title Index Prefetch Demo Add GetRangeAndHop to support index prefetch Oct 23, 2021
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for macOS Catalina 10.15

  • CodeBuild project: foundationdb-pr-macos
  • Commit ID: bed03c6
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: bed03c6
  • Result: FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)

@nblintao nblintao changed the title Add GetRangeAndHop to support index prefetch Introduce GetRangeAndHop to support index prefetch optimization Oct 26, 2021
@nblintao nblintao force-pushed the index-prefetch-demo branch from bed03c6 to 36091b9 Compare October 26, 2021 17:09
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: 36091b9
  • Result: FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: cc57303
  • Result: FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)

@nblintao nblintao requested a review from xumengpanda October 29, 2021 23:50
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: fbb116f
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@@ -2880,9 +2870,8 @@ ACTOR Future<Void> getKeyValuesAndHopQ(StorageServer* data, GetKeyValuesAndHopRe
GetKeyValuesReply _r = wait(
readRange(data, version, KeyRangeRef(begin, end), req.limit, &remainingLimitBytes, span.context, type));

// std::cout << "read range done, start hopping" << std::endl;
// Hop!!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel directly changing the std::cout to comment is more reader-friendly than "Hop":
read range done, start hopping

++data->counters.quickGetKeyValuesHit;

// Convert GetKeyValuesReply to RangeResult.
return RangeResult(RangeResultRef(reply.data, reply.more), reply.arena);
} catch (Error& e) {
// std::cout << "quickGetValue fallback because of exception or not managed by the shard " << e.name()
//<< std::endl;
// Fallback.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: your cout message body is a better comment line.

++data->counters.quickGetValueHit;
return reply.value;
} catch (Error& e) {
// std::cout << "quickGetValue fallback because of exception " << e.name() << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: your cout message body is a better comment line.

Copy link
Contributor

@halfprice halfprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very complex change. Still don't fully get the implementation detail.

// than the prefix of the key. So we don't use key() or create_data() in this test.
std::map<std::string, std::string> data;
for (int i = 0; i < 3; i++) {
data[indexEntryKey(i)] = EMPTY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you link index with the actual key? Looks like in this example, as long as the primary key is the suffix of the index key, it will work. It is not necessary to have the index key be in the record of the value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When users actually use it, the indexed field is likely to be stored in the value of the record, and encoded in someway (e.g. Protobuf in Record Layer). But that is not necessary for using this API.

Reverse reverse,
TransactionInfo info,
TagSet tags) {
template <class GetKeyValuesMaybeHopRequest>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client API has getRange and getRangeAndHop. Storage server also has getKeyValues and getKeyValuesAndHop. Why merging the two code path into GetKeyValuesMaybeHop in the middle? Seems unnecessary.

If the main idea is to reuse getRange implementation for both getRange and getRangeAndHop, is there a way we can tell whether a getRange is which one inside the getRange function?

Copy link
Contributor Author

@nblintao nblintao Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good obervation. Ideally, we should "reuse getRange implementation for both getRange and getRangeAndHop" to make the code clean and avoid duplicated code. But it turns out very hard (if not impossible) to make templates work with Flow compiler. This is one of a few places that we are able to merge.

I guess it makes sense to duplicated here as well, to keep things consistent and avoid confusions.

@@ -289,6 +289,21 @@ class Transaction : NonCopyable {
reverse);
}

[[nodiscard]] Future<RangeResult> getRangeAndHop(const KeySelector& begin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation how to use this API? Especially how to set hopInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there should be a document. But I'm not going to add it for now because the interface is going to be changed very soon in next PR.

@@ -296,6 +301,9 @@ struct GetKeyValuesRequest : TimedRequest {
SpanID spanContext;
Arena arena;
KeySelectorRef begin, end;
// This is a dummy field there has never been used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this, I think maybe we should consider sharing the implementation of getRange with different interface rather than an unified interface (the ...MaybeHop...).

referenceTuple = &valueTuple.get();
} else {
ASSERT(false);
throw internal_error();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we throw an error with more info here? So that we can quickly identify where is the internal_error() thrown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code will be removed in next PR, so I wish to no change it for now. nblintao@2b627b9

Tuple hopInfoTuple = Tuple()
.append("normal"_sr)
.append("{{escaped}}"_sr)
.append("{K[2]}"_sr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This K[], V[], is a little hard to read, still don't know how is the hopInfo is constucted and used.

@jzhou77
Copy link
Contributor

jzhou77 commented Nov 1, 2021

Some comments coming up during the discussion:

  • Documentation about this feature
  • Try to avoid slow tasks or too much work

@sfc-gh-satherton
Copy link
Collaborator

I think the ideal implementation of this feature would be that the storage server only returns to the client the results for secondary lookups it was able to resolve locally. Then, the FDB client can resolve the rest of the lookups with parallelism before the C API returns.

Internally, the return type to this request to the StorageServer interface could be logically equivalent to

std::vector<std::pair<Key, Optional<Optional<Value>>>>

and then the client can scan the vector and dispatch any !present() keys to resolve their values, which can themselves be non-present once resolve (hence the double Optional).

Copy link
Contributor

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one inline comment about naming, but I have a few questions as well around the semantics of this (which I did try to find in the code itself, but figured it might be easier to just ask):

  1. Regarding continuations: I was originally a bit concerned that if a get-range-and-hop scan was terminated prematurely (e.g., to split a range scan across multiple transactions to avoid the 5 second limit) that it would be difficult
  2. Does the key info language support returning a range of kv-pairs (per source key)? The way that the Record Layer stores records, it will both need to be able to concatenate various values together (from a range read that is currently done per index entry) as well as differentiate which record key is which (in order to differentiate the record version from the protobuf record data). (This request is fairly record layer specific, but you could imagine other layers with slightly different data models still wanting similar control. For example, you could imagine a different layer that stored each field in a record in its own key-value pair, grouped per record, so a query of the form SELECT field_1, field_2 FROM type WHERE field_3 > x might be satisfied by scanning an index on field_3 and then looking up only the values for field_1 and field_2 for each entry, which live on separate keys.)
  3. How does the key info language handle escape characters? For example, if I legitimately had the value {K[4]} in my query string, I wouldn't want that to be accidentally interpreted as looking for value 4 in the K tuple. (If there's a spec for the language, that would probably answer both this question and the preceding.)
  4. What's the behavior of empty records (e.g., as might result from there being a dangling index entry that points to a record that has since been deleted)? In particular, I could see users wanting to sometimes error on those situations or sometimes ignore them, but that implies being able to distinguish an empty value from a missing value.

@@ -244,6 +244,24 @@ DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_transaction_get_range(FDBTransaction
fdb_bool_t reverse);
#endif

DLLEXPORT WARN_UNUSED_RESULT FDBFuture* fdb_transaction_get_range_and_hop(FDBTransaction* tr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I don't want to bike shed too much about names, but this is in the public API, so it would be difficult to change in the future. I'm not sure "hop" really captures what this does. Maybe "resolve" or even "lookup"--in particular, when I hear "hop", I assume it's going to make a network hop, which (ideally) it won't have to when used optimally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for pointing out. Changed to "flat map"after discussion.

@nblintao
Copy link
Contributor Author

nblintao commented Nov 1, 2021

@alecgrieser Good questions.

Regarding continuations: I was originally a bit concerned that if a get-range-and-hop scan was terminated prematurely (e.g., to split a range scan across multiple transactions to avoid the 5 second limit) that it would be difficult

The current version doesn't have well support or tests for continuation. It will fail. Clients should have a fallback using the old way. It is planned to be improved in the future version.

Does the key info language support returning a range of kv-pairs (per source key)?

Yes, it is supported (apparently I simplified this in the presentation). If you add a special element {...} as the last element of hop info, it will do a range query using the generate key as the prefix. /~https://github.com/apple/foundationdb/pull/5609/files#diff-c4dbfae809f757f2061099ee5021ce2cbbf0f4dbc2f0f18480a4856d63d033ceR2618 Record Layer doesn't need it but we can change HopInfo to define a range by begin/end, rather than just prefix, in the future.

How does the key info language handle escape characters? ... (If there's a spec for the language, that would probably answer both this question and the preceding.)

What's the behavior of empty records (e.g., as might result from there being a dangling index entry that points to a record that has since been deleted)? In particular, I could see users wanting to sometimes error on those situations or sometimes ignore them, but that implies being able to distinguish an empty value from a missing value.

In the current version, it doesn't raise any problem. I had a TODO here /~https://github.com/apple/foundationdb/pull/5609/files#diff-c4dbfae809f757f2061099ee5021ce2cbbf0f4dbc2f0f18480a4856d63d033ceR2763. We could consider add this as an option in HopInfo too.

@alecgrieser
Copy link
Contributor

Okay, I see. I think that sounds good to me, with a few more responses below.

The current version doesn't have well support or tests for continuation. It will fail.

What does that look like? Does the user get some kind of error (somehow), or is it more like if the user knows that they won't be able to guarantee that they read everything in one transaction, they have to pick a different way to read the data.

I know this HopInfo thing becomes hackey, so there is a separate PR to encode HopInfo in flatbuffer

A more structured HopInfo sounds like a good improvement. I think my concern with adding it later (rather than sooner) is mainly around backwards-compatibility (in that we wouldn't want to make this a breaking API change when we go from Tuples to FlatBuffer objects...I think) which could also mean that we'd need to have translation logic to go from one to the other, which maybe is messy (or maybe it's actually not that bad).

A lot of this might actually come down to how comfortable we are with essentially releasing this as a "beta" feature that we intend to make backwards incompatible changes to that users have to use "at their own risk".

@nblintao
Copy link
Contributor Author

nblintao commented Nov 1, 2021

@sfc-gh-satherton, thanks for pointing out. @MMcM has also mentioned this as method 2 at #5609 (comment). That's exactly what I'm planning to do in the road map. This can avoid deteriorating SS throughput when fallback is needed (which is also summarized in @jzhou77's comment). In addition to that, I can imagine that users should be able to define how many requests a FDB client is able to dispatch simultaneously during fallback (like "pipeline size" concept in Record Layer), which adds additional complexity.

The current implementation handles fallback at SS because it was easier to implement at the beginning. SS is still protected from deteriorated performance by having a knob to fail rather than fallback when the key is not in the same SS (The that the it can fallback at application level).

Thanks @jzhou77 for summarizing. I wish I was in the discussion to explain:

This is an initial PR of a large feature that is going to be evolved in months if not years. Before I pushed out the "finalized" version of this PR for review, I complied a list of more than a dozen items I wanted. Surprisingly, I see every major items in the comments had been included in my list (So probably we'll have a very good consensus on the ultimate design). For many of them, it would be messy and tiring to change in the future rather than get it right in the first place (for example, a new HopInfo schema @alecgrieser). But I had to choose a set things that can be done before 7.0 release, and safe and usable for beta users.

I'll later post a document about the road map and the ultimate design. I hope to have more discussions in a meeting or offline.

I'm still on the way of getting comfortable with this get-features-in-first-and-improve-later thing, and trying to balance it with my desire of quality and good design. It's not easy.

@nblintao nblintao force-pushed the index-prefetch-demo branch from fbb116f to 3f15686 Compare November 3, 2021 20:21
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: 3f15686
  • Result: FAILED
  • Error: Error while executing command: ninja -v -C ${BUILD_DIR} -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Logs (available for 30 days)

@nblintao nblintao force-pushed the index-prefetch-demo branch from 3f15686 to 6c98e35 Compare November 3, 2021 20:32
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for macOS Catalina 10.15

  • CodeBuild project: foundationdb-pr-macos
  • Commit ID: 6c98e35
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: foundationdb-pr
  • Commit ID: 6c98e35
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

@nblintao nblintao changed the title Introduce GetRangeAndHop to support index prefetch optimization Introduce GetRangeAndFlatMap to push computations down to FDB Nov 3, 2021
@xumengpanda xumengpanda assigned halfprice and unassigned xumengpanda Nov 4, 2021
Copy link
Contributor

@halfprice halfprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SS logic LGTM. Please make sure you make the API clear before merging to 7.0 (documentation, making contract clear, etc.).

GetRangeLimits limits,
Snapshot snapshot,
Reverse reverse) {
if (getDatabase()->apiVersionAtLeast(630)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 630 true? Not 700?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the 630 copied from getRange. It is 630 because it handles special keys differently before that. Since the new API is added in 700, getDatabase()->apiVersionAtLeast(630) should aways be true. We could simplify the code by removing the else, but I would prefer keep it consistent with getRange so we can deduplicate in the future.

@@ -310,6 +318,43 @@ struct GetKeyValuesRequest : TimedRequest {
}
};

struct GetKeyValuesAndFlatMapReply : public LoadBalancedReply {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it worth noting that if one changes GetKeyValuesAndFlatMapRequest/Reply, it has to change GetKeyValuesRequest/Reply as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will add a comment in a different PR. And I really wish I can find out a way to make templates work better with Flow.

return Void();
}

ACTOR Future<GetKeyValuesAndFlatMapReply> flatMap(StorageServer* data, GetKeyValuesReply input, StringRef mapper) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you add documentations to these functions at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants