[Question] Sed and regex string manipulation

[email protected]

Hello! I will take a look at it, I just haven't had a chance over the last day. Give me a couple days and I will give some feedback. Bear in mind I am not an expert, so I might not have much to offer, but I'll share what I can.

[email protected]

This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

[email protected]

I did it!! It also handles the case where an external link and internal link are on the same line

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Here is my annotated file

# Begin loop
:l;

# Bisect first link in pattern space into pattern space and append to hold space
# Example: `text [label](file#fragment)'
#   Pattern space: `file#fragment)'
#   Hold space: `text [label]('
# Steps:
#   1. Strategically insert \n
#       1a. If this fails, branch out
#   2. Append to hold space (this creates two \n's. It feels weird for the
#      first iteration, but that's ok)
#   3. Copy hold space to pattern space, remove first \n, then trim off
#      everything past the second \n
#   4. Swap pattern/hold, and trim off everything up to and incl the last \n
s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;
Te;
H;
g; s/\n//; s/\n.*//;
x; s/.*\n//;

# Modify only if it is an internal link
/^https?:/! {
    # Add hyphens
    :h;
    s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;
    th;
    # Make lowercase
    s/(#[^)]*\))/\L\1/;
};

# "conditional" branch so it checks the next conditional again
tl;

# Exit: join pattern space to hold space, then move to pattern space.
# Since the loop uses H instead of h, have to make sure hold space is empty
:e;
H;
z;
x; s/\n//;

[email protected]

Wow ! Thank you ! It did a rapid test on a test-file.md

[Just a test](#just-a-test)
[Just a link](https://mylink/%20with%20space.com)
[External link](readme.md#just-a-test)
[Link with numbers](readme.md#1-3-this-is-another-test)
[Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test)

Great job ! Thank you very much !!! I'm really impressed what someone with proper knowledge can do ! However, I really do not want to mess around with your regex... This will only call for disaster xD ! I will keep preciously your regex and annotated file in my knowledge base, I'm sure some time in the future I will come back to it and try to break it down as learning process.

Thank you very much !!!

[email protected]

Hello Sorry to pin you, I just gave pandoc a try but it doesn't work and I had to dig a bit further into the web to find out why !

Links to Headings with Spaces are not specified by CommonMark and each tool implement a different approach... Most replace space with hyphens other use URL encoding (%20). So even though pandoc looks awesome it doesn't work for my use case (or did i miss something? Feel free to comment).

You can give it a try on https://pandoc.org/try/ with commonmark to gfm:

[Just a test](#Just a test)
[Just a link](https://mylink/%20with%20space.com)
[External link](Readme.md#JUST%20a%20test)
[Link with numbers](readme.md#1.3%20this%20is%20another%20test)
[Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test)

If you prefere a cli version:

pandoc --from=commonmark_x --to=gfm+gfm_auto_identifiers "/home/user/Documents/test.md" -o "pandoc_test.md"

[email protected]

Thank you ! It does actually ticks every use case (for my files) looks pretty rad !

This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

I totally agree but I will keep your regex as reference, in the near future I will give it a try to decompose you regex as learning process but it looks rather very complex !

Another user came up with the following solution:

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Just as a little experiment, If you want to spend some time and give me a answer, what do you think? It's a another way to achieve the same kind of results but they are significantly different. I know there a thousand ways to achieve the same results but I'm kinda curious how it looks from an experts eyes :).

Thanks again for your help and the time you took to write up a complex regex for my use case !

[email protected]

Hey take your time Don't worry even if you forget, you did more than enough to help some random on the web ! 2 other users came up with a plain/bare bone regex solution if you want to have a look and maybe there's something you can learn out of it? (I doubt it xD).

Plain sed regex (https://lemmy.ml/post/25346014/16453351)

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Plain Pearl regex (https://lemmy.ml/post/25346014/16453161)

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

Nonetheless, I really prefere your solution because as someone else said I will have an easier time to change a script I "understand". Soo thanks again !

[email protected]

Hey I just did a quick web search and found this. I haven't used the tool specifically before. However I recommend either searching the web for a similar tool or using a chatgpt like tool to create a python script that'll achieve your end result. Sed and regex are cool and useful, but they're only going to make it more difficult to achieve what you need.

[email protected]

No problem. I think this is a great "final boss" question for learning sed, because it turns out it is deceptively hard!! You have to understand not only a lot about regex, but about sed to get it right. I learned a lot about sed just by tackling this problem!

I really do not want to mess around with your regex

It is very delicate for sure, but one part you can for sure change is at the # Add hyphens part. In the regex you can see (%20|\.). These are a list of "characters" which get converted to hyphens. For example, you could modify it to (%20|\.|\+) and it will convert +s to -s as well!

Still it is not perfect:

If the link spans multiple lines, the regex won't match
If the link contains escaped characters like \\\\\[LINK](#LINK) or [LINK\]\\\\](#LINK)
If the link is inside a code block ``` it will get changed (which may or may not be intended)

But for a sed-only solution this is about as good as it will get I'm afraid.

Overall I'm very happy with it. Someday I would like to make a video that goes into depth about sed, since it is tricky to learn just from the docs.

[email protected]

Well, I'm not going to even try understanding the various features used in that sed command. I do know how to use basic loops with labels, but I never bothered with all the buffer manipulation stuff. I'd rather use awk/perl/python for those cases.

agnos.is Forums

[Question] Sed and regex string manipulation